'CAinterprTools': R package for visual aid to Correspondence Analysis interpretation
Some of the features of the R script for CA (described in this site) have been turned into an R package. The package is available from my GitHib repository, and can be installed from that repository straight into R via the 'devtools' package (see instruction at the bottom of this page). Besides implementing some of the features of my CA script, the package dramatically expands the facilities available to get a viasual aid to the interpretation of the results of Correspondence Analysis. Among other things, the package allows to calculate the significance of the CA dimensions and of the total inertia by means of a permutation test. Further, the package comes with some builtin dataset, among which one of those used by Prof. Michael Greenacre in his book on CA. The dataset is named 'greenacre_data' [after Greenacre 2007 (p. 90, exhibit 12.1)].
The package is also described in an article of mine published in Elsevier's SoftwareX journal (LINK).
If you want to cite my package, you may use the following format:
Gianmarco Alberti, CAinterprTools: An R package to help interpreting Correspondence Analysis’ results, SoftwareX, Volumes 1–2, September 2015, Pages 2631, ISSN 23527110, http://dx.doi.org/10.1016/j.softx.2015.07.001.
Here is a list of the implemented commands, with short examples of their use (using the mentioned 'greenacre_data' dataset):
data("greenacre_data")
loads the sample dataset.
Some of the features of the R script for CA (described in this site) have been turned into an R package. The package is available from my GitHib repository, and can be installed from that repository straight into R via the 'devtools' package (see instruction at the bottom of this page). Besides implementing some of the features of my CA script, the package dramatically expands the facilities available to get a viasual aid to the interpretation of the results of Correspondence Analysis. Among other things, the package allows to calculate the significance of the CA dimensions and of the total inertia by means of a permutation test. Further, the package comes with some builtin dataset, among which one of those used by Prof. Michael Greenacre in his book on CA. The dataset is named 'greenacre_data' [after Greenacre 2007 (p. 90, exhibit 12.1)].
The package is also described in an article of mine published in Elsevier's SoftwareX journal (LINK).
If you want to cite my package, you may use the following format:
Gianmarco Alberti, CAinterprTools: An R package to help interpreting Correspondence Analysis’ results, SoftwareX, Volumes 1–2, September 2015, Pages 2631, ISSN 23527110, http://dx.doi.org/10.1016/j.softx.2015.07.001.
Here is a list of the implemented commands, with short examples of their use (using the mentioned 'greenacre_data' dataset):
data("greenacre_data")
loads the sample dataset.
ca.corr(greenacre_data)
displays a bar plot of the strength of the correlation between rows and columns of the input contingency table.
sig.tot.inertia.perm(greenacre_data, k=10000)
calculates the significance of the CA total inertia via permutation test (using 10000 permutations); a density curve of the permuted total inertia is displayed along with the observed total inertia and the 95th percentile of the permuted total inertia. The latter can be regarded as a 0.05 alpha threshold for the observed total inertia's significance. The number of permutations can be set by the user (1000 is set by default).
aver.rule(greenacre_data)
returns a chart suggesting which CA dimension is important for data structure interpretation, according to the socalled 'average rule'.
malinvaud(greenacre_data)
performs the Malinvaud test and print on screen the test's result (among which the significance of the CA dimensions); a plot is also provided, wherein a reference line (in RED) indicates the 0.05 threshold.
sig.dim.perm.scree(greenacre_data, 1, k=1000)
calculates the significance of the CA dimensions via permutation test (using 1000 permutations), and displays the results as a screeplot. The balck dots represent the observed eigenvalues, while the blue dots represent the 95th percentile of the distribution of the permuted egenvalues. Dimensions whose dots are above the blue dots are significant at at least alpha 0.05.
sig.dim.perm(greenacre_data, 1, 2, k=10000)
calculates the significance of the 1 and 2 CA dimensions via permutation test (using 10000 permutations), and displays the results as a scatterplot; reference lines provide information about the significance of the selected dimensions. The number of permutations can be set by the user (1000 is set by default).
rows.cntr(greenacre_data, 1, cti=TRUE, sort=TRUE)
displays the contribution of the row categories to the 1 CA dimension; a reference line indicates the threshold above which a contribution can be considered important for the determination of the selected dimension. The parameter cti=TRUE specifies that the categories' contribution to the total inertia is also shown (hollow circle). The parameter sort=TRUE sorts the categories in descending order of contribution to the inertia of the selected dimension. At the lefthand side of the plot, the categories' labels are given a symbol (+ or ) according to wheather each category is actually contributing to the definition of the positive or negative side of the dimension, respectively. At the righthand side, a legend reports the correlation of the column categories with the selected dimension, the 1st one in this example. A symbol (+ or ) indicates with which side of the selected dimension each column category is correlated. As far as the interpretation is concerned, Zoology, for instance, is the major contributor to the definition of the negative side of the 1st dimension, and category D has a very high correlation with the negative side of the 1st dimension.
rows.cntr.scatter(greenacre_data, 1, 2)
displays a scatterplot for the row categories contribution to dimension 1&2; reference lines indicate the threshold above which the contribution can be considered important. A diagonal line (in BLACK) is a visual aid to eyeball whether a category is actually contributing more (in relative terms) to either of the two dimensions. The column categories' labels are coupled with two + or  symbols within round brackets indicating which to side of the two selected dimensions the contribution values that can be read off from the chart are actually referring. The first symbol (i.e., the one to the left), either + or , refers to the first of the selected dimensions (i.e., the one reported on the xaxis). The second symbol (i.e., the one to the right) refers to the second of the selected dimensions (i.e., the one reported on the yaxis). In the below example, Zoology is a major contributor to the definition of the negative side of the 1st dimension (note: it is beyond the vertical dashed red line AND has a "" symbol in first position within brackets); Zoology contribution to the positive side of the second dimension (note the "+" symbol in second position within brackets) is not important (note: the point is below the horizontal dashed red line indicating the threshold for an important contribution to the definition of the 2nd dimension).
rows.qlt(greenacre_data, 1, 2, sort=TRUE)
displays the quality of row categories display on the subspace determined by the 1&2 CA dimensions; the parameter sort=TRUE sort the categories in decreasing order of quality.
rows.corr(greenacre_data, 1, sort=TRUE)
displays the correlation of the row categories with the 1 CA dimension; the parameter sort=TRUE arrange the categories in decreasing order of correlation. At the lefthand side, the categories' labels show a symbol (+ or ) according to which side of the selected dimension they are correlated, either positive or negative. At the righthand side, a legend indicates the column categories' contribution to the selected dimension (value enclosed within round brackets), and a symbol (+ or ) indicating whether they are actually contributing to the definition of the positive or negative side of the dimension, respectively. Further, an asterisk (*) flags the categories which can be considered major contributors to the definition of the dimension. For instance, Physics has a high correlation with the positive side of the first dimension, which is actually defined by the column category A (which is one of the two major contributors to the definition of the 1st dimension).
rows.corr.scatter(greenacre_data, 1, 2)
displays a scatterplot for row categories correlation with dimension 1&2.. A diagonal line (in BLACK) is a visual aid to eyeball whether a category is actually more correlated (in relative terms) to either of the two dimensions. The row categories' labels are coupled with two + or  symbols within round brackets indicating to which side of the two selected dimensions the correlation values that can be read off from the chart are actually referring. The first symbol (i.e., the one to the left), either + or , refers to the first of the selected dimensions (i.e., the one reported on the xaxis). The second symbol (i.e., the one to the right) refers to the second of the selected dimensions (i.e., the one reported on the yaxis). In the example below, Zoology for instance has a high correlation with the negative side of the 1st dimension (the latter being indicated by the first symbol of the two between brackets), while has a smaller correlation with the positive side of the 2nd dimension (the latter being indicated by the "+" symbol in second position between brackets).
Needless to say, the above commands (which regard row categories) have column counterparts:
cols.cntr(greenacre_data, 1, cti=TRUE, sort=FALSE)
cols.cntr.scatter(greenacre_data, 1, 2)
ca.cols.qlt(greenacre_data, 1, 2, sort=TRUE)
cols.corr(greenacre_data, 1, sort=TRUE)
cols.corr.scatter(greenacre_data, 1, 2)
Versions history
As of version 0.5, 'CAinterprTools' integrates two functions that are described elsewhere in this same site, as well as a brand new third one:
1) ca.scatter(): described at this page in this same site
2) ca.plus(): described at this page in this same site
3) sig.dim.perm.scree(): it allows to test the significance of the CA dimensions by means of permutation of the input contingency table. The number of permutations used is entered by the user. The function return a scree plot displaying for each dimension the observed eigenvalue and the 95th percentile of the permuted distribution of the corresponding eigenvalue. Observed eigenvalues that are larger than the corresponding 95th percentile are significant at least at alpha 0.05. See the command's help provided by the package for further details. As of version 0.9 (see below), p values are displayed straight into the chart.
New in version 0.6: 'ggplot2' and 'ggrepel' package are used to produce the charts returned by the functions: cols.cntr.scatter(), rows.cntr.scatter(), cols.corr.scatter(), rows.corr.scatter(). The two packages have been preferred over R base plotting facitily for their ability to plot non overlapping point labels. This will allow complex charts to have notoless cluttered labels.
New in version 0.7: ca.percept() has been added to the package; the function is described at this page in this same site. The brand_coffee dataset has been also included. The dataset is after Kennedy et al, Practical Applications of Correspondence Analysis to Categorical Data in Market Research, in Journal of Targeting Measurement and Analysis for Marketing, 1996. Minor corrections have been done to the help documentation of a handfull of commands.
New in version 0.8: the facility has been added to the rows.cntr() and cols.contr() functions to sort the categories in descending order of contribution to the inertia of the selected dimension. Minor corrections have been done to the help documentation of a handfull of commands.
New in version 0.9: the facility has been added to the sig.dim.perm.scree() function to display p values directly into the chart.
New in version 0.10: the facility has been added to the rows.corr(), cols.corr(), rows.qlt(), and cols.qlt() functions to sort the categories in descending order of correlation to the selected dimension and of quality of the representation on the subspace defined by the selected pair of dimensions. Minor corrections have been done to the help documentation of a handfull of commands.
New in version 0.11: the functions rows.cntr(), cols.cntr(), rows.corr(), and cols.corr() have been improved; symbols have been added to the dotplot's labels indicating with which side of the selected dimension the row/column categories are actually contributing (for the rows.cntr() and cols.cntr() functions) or with which side of the selected dimension the categories are correlated (for the rows.corr() and cols.corr() function). A legend has been added containing information crucial to the interpretation of the CA results.
New in version 0.12: the functions rows.cntr.scatter() and cols.cntr.scatter() have been improved by adding more informative labels to the categories' points.
New in version 0.13: the functions ca.plot() and ca.cluster(), and the 'deseases' dataset, have been added. The latter is from VellemanHoaglin, "Applications, Basics, and Computing of Exploratory Data Analysis", Wadsworth Pub Co (1984), Exhibit 81.
'CAinterprTools' package installation To install the package in R, just follow the few steps listed below (you can copy and paste the highlighted pieces of code): 1) install the 'devtools' package: install.packages("devtools", dependencies=TRUE) 2) load that package: library(devtools) 3) download the 'CAinterprTools' package from GitHub via the 'devtools''s command: install_github("gianmarcoalberti/CAinterprTools") 4) load the package: library(CAinterprTools) 5) enjoy! 

Have you found this website helpful? Consider to leave a comment in this page.