File: corrplot-intro.Rmd

package info (click to toggle)
r-cran-corrplot 0.95-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 5,212 kB
sloc: sh: 13; makefile: 5
file content (531 lines) | stat: -rw-r--r-- 20,198 bytes
parent folder | download | duplicates (2)
---
title: 'An Introduction to corrplot Package'
author: 'Taiyun Wei, Viliam Simko'
date: '`r Sys.Date()`'
output:
  prettydoc::html_pretty:
    theme: cayman
    toc: true
toc-title: 'Table of Contents'
vignette: >
  %\VignetteIndexEntry{An Introduction to corrplot Package}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(
  fig.align = 'center',
  fig.path = 'webimg/',
  fig.width = 7,
  fig.height = 7,
  out.width = '600px',
  dev = 'png')

get_os = function() {
  sysinf = Sys.info()
  if (!is.null(sysinf)) {
    os = sysinf['sysname']
    if (os == 'Darwin')
      os = 'osx'
  } else { ## mystery machine
    os = .Platform$OS.type
    if (grepl('^darwin', R.version$os))
      os = 'osx'
    if (grepl('linux-gnu', R.version$os))
      os = 'linux'
  }
  tolower(os)
}
if(get_os() =='windows' & capabilities('cairo') | all(capabilities(c('cairo', 'X11')))) {
  knitr::opts_chunk$set(dev.args = list(type='cairo'))
}

```

## Introduction

R package **corrplot** provides a visual exploratory tool on correlation matrix that 
supports automatic variable reordering to help detect hidden patterns among variables.

corrplot is very easy to use and provides a rich array of plotting options in 
visualization method, graphic layout, color, legend, text labels, etc. 
It also provides p-values and confidence intervals to help users determine the 
statistical significance of the correlations.


`corrplot()` has about 50 parameters, however the mostly common ones are only a few. 
We can get a correlation matrix plot with only one line of code in most scenes.

The mostly using parameters include `method`, `type`, `order`, `diag`, and etc.

There are seven visualization methods (parameter `method`) in
corrplot package, named `'circle'`, `'square'`, `'ellipse'`,
`'number'`, `'shade'`, `'color'`, `'pie'`. Color intensity of the glyph
is proportional to the correlation coefficients by default color setting. 

- `'circle'` and `'square'`, the **areas** of circles or squares show the
absolute value of corresponding correlation coefficients. 

- `'ellipse'`, the ellipses have their eccentricity parametrically scaled to the correlation value. 
It comes from D.J. Murdoch and E.D. Chow's job, see in section References.

- `'number'`, coefficients numbers with different color. 

- `'color'`, square of equal size with different color.

- `'shade'`, similar to `'color'`, but the negative coefficients glyphs are shaded. 
Method `'pie'` and `'shade'` come from Michael Friendly's job. 

- `'pie'`, the circles are filled clockwise for positive values, anti-clockwise for negative
values. 





`corrplot.mixed()` is a wrapped function for mixed visualization style,
which can set the visual methods of lower and upper triangular
separately.

There are three layout types (parameter `type`): `'full'`, `'upper'` and
`'lower'`.

The correlation matrix can be reordered according to the correlation
matrix coefficients. This is important to identify the hidden structure
and pattern in the matrix.

```{r intro}
library(corrplot)
M = cor(mtcars)
corrplot(M, method = 'number') # colorful number
corrplot(M, method = 'color', order = 'alphabet')
corrplot(M) # by default, method = 'circle'
corrplot(M, order = 'AOE') # after 'AOE' reorder
corrplot(M, method = 'shade', order = 'AOE', diag = FALSE)
corrplot(M, method = 'square', order = 'FPC', type = 'lower', diag = FALSE)
corrplot(M, method = 'ellipse', order = 'AOE', type = 'upper')
corrplot.mixed(M, order = 'AOE')
corrplot.mixed(M, lower = 'shade', upper = 'pie', order = 'hclust')
```

## Reorder a correlation matrix

The details of four `order` algorithms, named `'AOE'`, `'FPC'`,
`'hclust'`, `'alphabet'` are as following.

-   `'AOE'` is for the angular order of the eigenvectors. It is
    calculated from the order of the angles $a_i$,

    $$
    a_i = 
    \begin{cases}
                \arctan (e_{i2}/e_{i1}), & \text{if $e_{i1}>0$;}
                 \newline
                \arctan (e_{i2}/e_{i1}) + \pi, & \text{otherwise.}
    \end{cases}         
    $$

    where $e_1$ and $e_2$ are the largest two eigenvalues of the
    correlation matrix. See [Michael Friendly
    (2002)](http://www.datavis.ca/papers/corrgram.pdf) for details.

-   `'FPC'` for the first principal component order.

-   `'hclust'` for hierarchical clustering order, and `'hclust.method'`
    for the agglomeration method to be used. `'hclust.method'` should be
    one of `'ward'`, `'ward.D'`, `'ward.D2'`, `'single'`, `'complete'`, 
    `'average'`, `'mcquitty'`, `'median'` or `'centroid'`.

-   `'alphabet'` for alphabetical order.


You can also reorder the matrix 'manually' via function
`corrMatOrder()`.

If using `'hclust'`, `corrplot()` can draw rectangles around the plot of
correlation matrix based on the results of hierarchical clustering.

```{r hclust}
corrplot(M, order = 'hclust', addrect = 2)
corrplot(M, method = 'square', diag = FALSE, order = 'hclust',
         addrect = 3, rect.col = 'blue', rect.lwd = 3, tl.pos = 'd')
```

R package **seriation** provides the infrastructure for ordering objects with an 
implementation of several seriation/sequencing/ordination techniques to reorder 
matrices, dissimilarity matrices, and dendrograms. For more information, 
see in section References.

We can reorder the matrix via **seriation** package and then corrplot it. 
Here are some examples.



```{r seriation}
library(seriation)
list_seriation_methods('matrix')
list_seriation_methods('dist')

data(Zoo)
Z = cor(Zoo[, -c(15, 17)])

dist2order = function(corr, method, ...) {
  d_corr = as.dist(1 - corr)
  s = seriate(d_corr, method = method, ...)
  i = get_order(s)
  return(i)
}
```

Methods `'PCA_angle'` and `'HC'` in **seriation**, are same as `'AOE'` and `'hclust'` 
separately in `corrplot()` and `corrMatOrder()`.

Here are some plots after seriation.

```{r seriation-plot}
# Fast Optimal Leaf Ordering for Hierarchical Clustering
i = dist2order(Z, 'OLO')
corrplot(Z[i, i], cl.pos = 'n')

# Quadratic Assignment Problem
i = dist2order(Z, 'QAP_2SUM')
corrplot(Z[i, i], cl.pos = 'n')

# Multidimensional Scaling
i = dist2order(Z, 'MDS_nonmetric')
corrplot(Z[i, i], cl.pos = 'n')

# Simulated annealing
i = dist2order(Z, 'ARSA')
corrplot(Z[i, i], cl.pos = 'n')

# TSP solver
i = dist2order(Z, 'TSP')
corrplot(Z[i, i], cl.pos = 'n')

# Spectral seriation
i = dist2order(Z, 'Spectral')
corrplot(Z[i, i], cl.pos = 'n')
```

`corrRect()` can add rectangles on the plot with three ways(parameter
`index`, `name` and `namesMat`) after `corrplot()`.
We can use pipe operator `*>%` in package `magrittr` with more convenience. 
Since R 4.1.0,  `|>` is supported without extra package.

```{r rectangles}
library(magrittr)

# Rank-two ellipse seriation, use index parameter
i = dist2order(Z, 'R2E')
corrplot(Z[i, i], cl.pos = 'n') %>% corrRect(c(1, 9, 15))

# use name parameter
# Since R 4.1.0, the following one line code works:
# corrplot(M, order = 'AOE') |> corrRect(name = c('gear', 'wt', 'carb'))
corrplot(Z, order = 'AOE') %>%
  corrRect(name = c('tail', 'airborne', 'venomous', 'predator'))


# use namesMat parameter
r = rbind(c('eggs', 'catsize', 'airborne', 'milk'),
          c('catsize', 'eggs', 'milk', 'airborne'))
corrplot(Z, order = 'hclust') %>% corrRect(namesMat = r)
```

## Change color spectra, color-legend and text-legend

We can get sequential and diverging colors from `COL1()` and `COL2()`.
The color palettes are borrowed from `RColorBrewer` package. 

**Notice**: the middle color getting from `COL2()` is fixed to `'#FFFFFF'`(white), 
thus we can visualizing element 0 with white color.

- `COL1()`: Get sequential colors, suitable for visualize a non-negative or 
non-positive matrix (e.g. matrix in [0, 20], or [-100, -10], or [100, 500]). 
- `COL2()`: Get diverging colors, suitable for visualize a matrix which elements 
are partly positive and partly negative (e.g. correlation matrix in [-1, 1], or [-20, 100]).


The colors of the correlation plots can be customized by `col` in `corrplot()`. 
They are distributed uniformly in `col.lim` interval.

- `col`: vector, the colors of glyphs. They are distributed uniformly in `col.lim` interval. By default,
  - If `is.corr` is `TRUE`, `col` will be `COL2('RdBu', 200)`. 
  - If `is.corr` is `FALSE`, 
    - and `corr` is a non-negative or non-positive matrix, `col` will be `COL1('YlOrBr', 200)`;
    - otherwise (elements are partly positive and partly negative), `col` will be `COL2('RdBu', 200)`.
- `col.lim`: the limits (x1, x2) interval for assigning color by `col`. By default,
  - `col.lim` will be `c(-1, 1)` when `is.corr` is `TRUE`, 
  - `col.lim` will be `c(min(corr), max(corr))` when `is.corr` is `FALSE`.
  - **NOTICE**: if you set `col.lim` when `is.corr` is `TRUE`, the assigning colors are still 
  distributed uniformly in [-1, 1], it only affect the display on color-legend.
- `is.corr`: logical, whether the input matrix is a correlation matrix or not. The default value is `TRUE`.
We can visualize a non-correlation matrix by setting `is.corr = FALSE`. 

Here all diverging colors from `COL2()` and sequential colors from `COL1()` are shown below.

**Diverging colors**:

```{r echo=FALSE,  fig.width = 8, fig.height = 6, out.width = '700px'}
## diverging colors
plot.new()
par(mar = c(0, 0, 0, 0) + 0.1)
plot.window(xlim = c(-0.2, 1.1), ylim = c(0, 1), xaxs = 'i', yaxs = 'i')

col = c('RdBu', 'BrBG', 'PiYG', 'PRGn', 'PuOr', 'RdYlBu')

for(i in 1:length(col)) {
  colorlegend(COL2(col[i]), -10:10/10, align = 'l', cex = 0.8, xlim = c(0, 1),
              ylim = c(i/length(col)-0.1, i/length(col)), vertical = FALSE)
  text(-0.01, i/length(col)-0.02, col[i], adj = 0.5, pos = 2, cex = 0.8)
}
```

**Sequential colors**:


```{r echo=FALSE,  fig.width = 8, fig.height = 6, out.width = '700px'}
## sequential colors
plot.new()
par(mar = c(0, 0, 0, 0) + 0.1)
plot.window(xlim = c(-0.2, 1.1), ylim = c(0, 1), xaxs = 'i', yaxs = 'i')

col = c('Oranges', 'Purples', 'Reds', 'Blues', 'Greens', 'Greys', 'OrRd',
        'YlOrRd', 'YlOrBr', 'YlGn')

for(i in 1:length(col)) {
  colorlegend(COL1(col[i]), 0:10, align = 'l', cex = 0.8, xlim = c(0, 1),
              ylim = c(i/length(col)-0.1, i/length(col)), vertical = FALSE)
  text(-0.01, i/length(col)-0.02, col[i], adj = 0.5, pos = 2)
}
```

Usage of `COL1()` and `COL2()`:

```{r eval=FALSE}
COL1(sequential = c("Oranges", "Purples", "Reds", "Blues", "Greens", 
                    "Greys", "OrRd", "YlOrRd", "YlOrBr", "YlGn"), n = 200)

COL2(diverging = c("RdBu", "BrBG", "PiYG", "PRGn", "PuOr", "RdYlBu"), n = 200)
```


In addition, function `colorRampPalette()` is very convenient for generating color spectrum. 


Parameters group `cl.*` is for color-legend. The common-using are:

-   `cl.pos` is for the position of color labels. It is character or
    logical. If character, it must be one of `'r'` (means right, default
    if `type='upper'` or `'full'`), `'b'` (means bottom, default if
    `type='lower'`) or `'n'`(means don't draw color-label).
-   `cl.ratio` is to justify the width of color-legend, 0.1\~0.2 is
    suggested.

Parameters group `tl.*` is for text-legend. The common-using are:

-   `tl.pos` is for the position of text labels. It is character or
    logical. If character, it must be one of `'lt'`, `'ld'`, `'td'`,
    `'d'`, `'l'` or `'n'`. `'lt'`(default if `type='full'`) means left and top,
    `'ld'`(default if `type='lower'`) means left and diagonal,
    `'td'`(default if `type='upper'`) means top and diagonal(near),
    `'d'` means diagonal, `'l'` means left, `'n'` means don't add text-label.
-   `tl.cex` is for the size of text label (variable names).
-   `tl.srt` is for text label string rotation in degrees.

```{r color}
corrplot(M, order = 'AOE', col = COL2('RdBu', 10))
         
corrplot(M, order = 'AOE', addCoef.col = 'black', tl.pos = 'd',
         cl.pos = 'n', col = COL2('PiYG'))

corrplot(M, method = 'square', order = 'AOE', addCoef.col = 'black', tl.pos = 'd',
         cl.pos = 'n', col = COL2('BrBG'))

## bottom color legend, diagonal text legend, rotate text label
corrplot(M, order = 'AOE', cl.pos = 'b', tl.pos = 'd',
         col = COL2('PRGn'), diag = FALSE)

## text labels rotated 45 degrees and  wider color legend with numbers right aligned
corrplot(M, type = 'lower', order = 'hclust', tl.col = 'black',
         cl.ratio = 0.2, tl.srt = 45, col = COL2('PuOr', 10))

## remove color legend, text legend and principal diagonal glyph
corrplot(M, order = 'AOE', cl.pos = 'n', tl.pos = 'n',
         col = c('white', 'black'), bg = 'gold2')
```

## Visualize non-correlation matrix, NA value and math label

We can visualize a non-correlation matrix by set `is.corr=FALSE`, and
assign colors by `col.lim`. If the matrix have both positive and
negative values, the matrix transformation keep every values
positiveness and negativeness.

If your matrix is rectangular, you can adjust the aspect ratio with the
`win.asp` parameter to make the matrix rendered as a square.

```{r non-corr}
## matrix in [20, 26], grid.col
N1 = matrix(runif(80, 20, 26), 8)
corrplot(N1, is.corr = FALSE, col.lim = c(20, 30), method = 'color', tl.pos = 'n',
         col = COL1('YlGn'), cl.pos = 'b', addgrid.col = 'white', addCoef.col = 'grey50')


## matrix in [-15, 10]
N2 = matrix(runif(80, -15, 10), 8)

## using sequential colors, transKeepSign = FALSE
corrplot(N2, is.corr = FALSE, transKeepSign = FALSE, method = 'color', col.lim = c(-15, 10), 
         tl.pos = 'n', col = COL1('YlGn'), cl.pos = 'b', addCoef.col = 'grey50')

## using diverging colors, transKeepSign = TRUE (default)
corrplot(N2, is.corr = FALSE, col.lim = c(-15, 10), 
         tl.pos = 'n', col = COL2('PiYG'), cl.pos = 'b', addCoef.col = 'grey50')

## using diverging colors
corrplot(N2, is.corr = FALSE, method = 'color', col.lim = c(-15, 10), tl.pos = 'n',
         col = COL2('PiYG'), cl.pos = 'b', addCoef.col = 'grey50')
```

Notice: when `is.corr` is `TRUE`, `col.lim` only affect the color legend If
you change it, the color on correlation matrix plot is still assigned on
`c(-1, 1)`

```{r col-lim}
# when is.corr=TRUE, col.lim only affect the color legend display
corrplot(M/2)
corrplot(M/2, col.lim=c(-0.5, 0.5))
```

By default, **corrplot** renders NA values as `'?'` characters. Using
`na.label` parameter, it is possible to use a different value (max. two
characters are supported).

Since version `0.78`, it is possible to use
[plotmath](https://www.rdocumentation.org/packages/grDevices/topics/plotmath)
expression in variable names. To activate plotmath rendering, prefix
your label with `'$'`.

```{r NA-math}
M2 = M
diag(M2) = NA
colnames(M2) = rep(c('$alpha+beta', '$alpha[0]', '$alpha[beta]'),
                   c(4, 4, 3))
rownames(M2) = rep(c('$Sigma[i]^n', '$sigma',  '$alpha[0]^100', '$alpha[beta]'),
                   c(2, 4, 2, 3))
corrplot(10*abs(M2), is.corr = FALSE, col.lim = c(0, 10), tl.cex = 1.5)
```

## Visualize p-value and confidence interval

`corrplot()` can also visualize p-value and confidence interval on the
correlation matrix plot. Here are some important parameters.

About p-value:

-   `p.mat` is the p-value matrix, if `NULL`, parameter `sig.level`,
    `insig, pch`, `pch.col`, `pch.cex` are invalid.
-   `sig.level` is significant level, with default value 0.05. If the
    p-value in `p-mat` is bigger than `sig.level`, then the
    corresponding correlation coefficient is regarded as insignificant.
    If `insig` is `'label_sig'`, `sig.level` can be an increasing vector
    of significance levels, in which case `pch` will be used once for
    the highest p-value interval and multiple times (e.g. `'*'`, `'**'`,
    `'***'`) for each lower p-value interval.
-   `insig` Character, specialized insignificant correlation
    coefficients, `'pch'` (default), `'p-value'`, `'blank',` `'n'`, or
    `'label_sig'`. If `'blank'`, wipe away the corresponding glyphs; if
    `'p-value'`, add p-values the corresponding glyphs; if `'pch'`, add
    characters (see pch for details) on corresponding glyphs; if `'n'`,
    don't take any measures; if `'label_sig'`, mark significant
    correlations with `pch` (see `sig.level`).
-   `pch` is for adding character on the glyphs of insignificant
    correlation coefficients (only valid when insig is `'pch'`). See
    `?par` .

About confidence interval:

-   `plotCI` is character for the method of plotting confidence
    interval. If `'n'`, don't plot confidence interval. If `'rect'`,
    plot rectangles whose upper side means upper bound and lower side
    means lower bound respectively.
-   `lowCI.mat` is the matrix of the lower bound of confidence interval.
-   `uppCI.mat` is the Matrix of the upper bound of confidence interval.

We can get p-value matrix and confidence intervals matrix by
`cor.mtest()` which returns a list containing:

-   `p` is the p-values matrix.
-   `lowCI` is the lower bound of confidence interval matrix.
-   `uppCI` is the lower bound of confidence interval matrix.

```{r test}
testRes = cor.mtest(mtcars, conf.level = 0.95)

## specialized the insignificant value according to the significant level
corrplot(M, p.mat = testRes$p, sig.level = 0.10, order = 'hclust', addrect = 2)


## leave blank on non-significant coefficient
## add significant correlation coefficients
corrplot(M, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
         addCoef.col ='black', number.cex = 0.8, order = 'AOE', diag=FALSE)
```

```{r special}
## leave blank on non-significant coefficient
## add all correlation coefficients
corrplot(M, p.mat = testRes$p, method = 'circle', type = 'lower', insig='blank',
         order = 'AOE', diag = FALSE)$corrPos -> p1
text(p1$x, p1$y, round(p1$corr, 2))
```

```{r p-values}
## add p-values on no significant coefficients
corrplot(M, p.mat = testRes$p, insig = 'p-value')

## add all p-values
corrplot(M, p.mat = testRes$p, insig = 'p-value', sig.level = -1)

## add significant level stars
corrplot(M, p.mat = testRes$p, method = 'color', diag = FALSE, type = 'upper',
         sig.level = c(0.001, 0.01, 0.05), pch.cex = 0.9,
         insig = 'label_sig', pch.col = 'grey20', order = 'AOE')

## add significant level stars and cluster rectangles
corrplot(M, p.mat = testRes$p, tl.pos = 'd', order = 'hclust', addrect = 2,
         insig = 'label_sig', sig.level = c(0.001, 0.01, 0.05),
         pch.cex = 0.9, pch.col = 'grey20')
```

Visualize confidence interval.

```{r confidence-interval}
# Visualize confidence interval
corrplot(M, lowCI = testRes$lowCI, uppCI = testRes$uppCI, order = 'hclust',
         tl.pos = 'd', rect.col = 'navy', plotC = 'rect', cl.pos = 'n')

# Visualize confidence interval and cross the significant coefficients
corrplot(M, p.mat = testRes$p, lowCI = testRes$lowCI, uppCI = testRes$uppCI,
         addrect = 3, rect.col = 'navy', plotC = 'rect', cl.pos = 'n')
```

## References

- Michael Friendly (2002). Corrgrams: Exploratory displays for correlation
matrices. The American Statistician, 56, 316--324.

- D.J. Murdoch, E.D. Chow (1996). A graphical display of large correlation
matrices. The American Statistician, 50, 178--180.

- Michael Hahsler, Christian Buchta and Kurt Hornik (2020). seriation: Infrastructure for Ordering
  Objects Using Seriation. R package version 1.2-9. https://CRAN.R-project.org/package=seriation

- Hahsler M, Hornik K, Buchta C (2008). "Getting things in order: An introduction to the R package
seriation." _Journal of Statistical Software_, *25*(3), 1-34. ISSN 1548-7660, doi:
10.18637/jss.v025.i03 (URL: https://doi.org/10.18637/jss.v025.i03), <URL:
https://www.jstatsoft.org/v25/i03/>.