R: Functions for processing differential expression datasets

canprot {JMDplots}

R Documentation

Functions for processing differential expression datasets

Description

These functions were moved from the canprot package in February 2024.

Usage

  mkvig(vig = NULL)
  check_IDs(dat, IDcol, aa_file = NULL, updates_file = NULL)
  cleanup(dat, IDcol, up2 = NULL)
  get_comptab(pdat, var1 = "Zc", var2 = "nH2O", plot.it = FALSE,
    mfun = "median", oldstyle = FALSE)
  diffplot(comptab, vars = c("Zc", "nH2O"), col = "black", plot.rect = FALSE,
           pt.text = c(letters, LETTERS), cex.text = 0.85, oldstyle = FALSE,
           pch = 1, cex = 2.1, contour = TRUE, col.contour = par("fg"),
           probs = 0.5, add = FALSE, labtext = NULL, ...)
  qdist(pdat, vars = c("Zc", "nH2O"), show.steps = FALSE)
  xsummary(comptab, vars = c("Zc", "nH2O"))
  xsummary2(comptab1, comptab2)
  xsummary3(comptab1, comptab2, comptab3)

Arguments

`vig`	character, name of a vignette without ‘⁠.Rmd⁠’ extension
`dat`	data frame, protein expression data
`IDcol`	character, name of column that has the UniProt IDs
`aa_file`	character, name of file with additional amino acid compositions
`updates_file`	character, name of file with old to new ID mappings
`up2`	logical, TRUE for up-regulated proteins, FALSE for down-regulated proteins
`pdat`	list, data object generated by a `pdat_` function
`var1`	character, the first variable
`var2`	character, the second variable
`plot.it`	logical, make a scatterplot?
`mfun`	character, either ‘⁠median⁠’ or ‘⁠mean⁠’
`oldstyle`	logical, also calculate `CLES` and p-values?
`comptab`	list or data frame, output of `get_comptab`
`vars`	character, which variables (chemical metrics) to calculate or plot
`col`	character or numeric, color(s) for the points
`plot.rect`	logical, plot a reference rectangle?
`pt.text`	character, text labels for the points
`cex.text`	numeric, size of text labels
`pch`	numeric, point symbol
`cex`	numeric, point size
`contour`	logical, add contour lines?
`col.contour`	character or numeric, color of contour lines
`probs`	numeric, probability level(s) for contours
`add`	logical, add to an existing plot?
`labtext`	character, text to add to axis labels
`...`	other argumenents passed to `plot`
`show.steps`	logical, show the steps using `plot.ecdf`?
`comptab1`	list, output of `get_comptab`
`comptab2`	list, output of `get_comptab`
`comptab3`	list, output of `get_comptab`

Details

These functions for processing differential expression datasets for the gradH2O and canH2O papers were previously in the canprot package.

mkvig compiles the indicated vignette for chemical analysis of differential expression datasets and opens it in the browser. Pandoc (including pandoc-citeproc), as a system dependency of rmarkdown, must be installed. See rmarkdown's ‘⁠pandoc⁠’ vignette for installation tips. The vignettes can also be run using e.g. demo("HPA"), and through the interactive help system (help.start > Packages > JMDplots > Code demos). The available vignettes are listed here:

Cell culture – ‘⁠hypoxia⁠’, ‘⁠secreted⁠’, ‘⁠osmotic_bact⁠’, ‘⁠osmotic_euk⁠’, ‘⁠osmotic_halo⁠’, ‘⁠glucose⁠’, ‘⁠3D⁠’, ‘⁠osmotic_gene⁠’, ‘⁠yeast_stress⁠’
Cancer – ‘⁠breast⁠’, ‘⁠colorectal⁠’, ‘⁠liver⁠’, ‘⁠lung⁠’, ‘⁠pancreatic⁠’, ‘⁠prostate⁠’
Pan-cancer – ‘⁠TCGA⁠’, ‘⁠HPA⁠’

check_IDs is used to check for known UniProt IDs and to update obsolete IDs. The source IDs should be provided in the IDcol column of dat; multiple IDs for one protein can be separated by a semicolon. The function keeps the first “known” ID for each protein, which must be present in one of these groups:

The human.aa dataset of amino acid compositions.
Old UniProt IDs that are mapped to new UniProt IDs in ‘extdata/diffexpr/uniprot_updates.csv’ or in updates_file if specified.
IDs of proteins in aa_file, which lists amino acid compositions in the format described for human.aa (see thermo$protein for details).

cleanup removes proteins with unavailable IDs, ambiguous expression ratios, and duplicated IDs. IDcol is the name of the column that has the UniProt IDs, and up2 indicates the expression change for each protein. The function removes proteins with unavailable (NA or "") or duplicated IDs. If up2 is provided, the function also removes unquantified proteins (those that have NA values of up2) and those with ambiguous expression ratios (up and down for the same ID). For each operation, a message is printed describing the number of proteins that are ‘⁠unavailable⁠’, ‘⁠unquantified⁠’, ‘⁠ambiguous⁠’, or ‘⁠duplicated⁠’. Alternatively, if IDcol is a logical value, it selects proteins to be unconditionally removed.

get_comptab computes differences of chemical metrics between groups of up- and down-regulated proteins.

Differentially expressed proteins are identified by the value of pdat$up2 (TRUE for up-regulated proteins and FALSE for down-regulated proteins).
The differences are calculated as (median for up-regulated proteins) - (median for down-regulated proteins). If mfun is ‘⁠mean⁠’, means of the groups are used instead.
If oldstyle is TRUE, the function also calculates the common language effect size (CLES, in percent) and p-value for each variable.
Volume is calculated using amino acid group additivity as described by Dick et al. (2006).
Set plot.it to TRUE to make a scatterplot. Open red squares and filled blue circles stand for up-regulated and down-regulated proteins, respectively.

diffplot makes a plot with points showing the differences between up- and down-regulated proteins for two chemical metrics, as calculated by get_comptab.

The default setting of vars refers to average oxidation state of carbon (Z_C) as the x-variable and stoichiometric hydration state (n_H₂O) as the y-variable.
The colors of the points are controlled by col, which is recycled to be equal to the number of comparisons in comptab.
If plot.rect is TRUE, a shaded rectangle is drawn with coordinates -0.01, -0.01, 0.01, 0.01. This is useful for visualizing the different scales of multi-panel plots.
If pt.text is not NA or FALSE, text labels are added with size controlled by cex.text. The default value produces labels that are taken sequentially from the 26 lowercase Roman letters in alphabetical order (letters), followed by the set of uppercase letters (LETTERS).
For labtext = NULL, descriptive text (“median difference” or “mean difference”) is added to the axis labels in parentheses. This text can be changed by giving a value in labtext (for both axes), two values (for each axis), or NA to suppress the text.
cplab is a list of formatted labels used by diffplot. It is an exported object, available to the user and other packages.

qdist makes a quantile distribution plot with lines for both up- and down-regulated proteins. The variable (var) can be ‘⁠Zc⁠’, ‘⁠H2O⁠’, or both (two plots are made for the latter). The horizontal axis is the variable and the vertical axis is the quantile point. A solid black line is drawn for the down-regulated proteins, and a dashed red line for the up-regulated proteins. The median difference is shown by a gray horizontal line drawn between the distributions at the 0.5 quantile point.

xsummary makes an HTML table summarizing chemical differences using xtable. Bold and underline formatting is used to highlight significant chemical differences. The p-value is bolded if it is less than 0.05, and the percent common language effect size (CLES) is bolded if it is <= 40 or >= 60. The mean (or median) difference is [underlined / bolded] if [only one of / both] the p-value and CLES pass these cutoffs. The generated table is written to the console, and can be used in a vignette using the results = "asis" chunk option.

xsummary2 is an updated version that is used in the current vignettes in the package. It shows negative numbers in bold (p-value and CLES are not shown). xsummary3 is a further revision that shows GRAVY and pI; it is used in the ‘⁠osmotic_bact⁠’ and ‘⁠osmotic_halo⁠’ vignettes.

Value

For check_IDs, dat is returned with possibly changed values in the column designated by IDcol; old IDs are replaced with new ones, the first known ID for each protein is kept, then proteins with no known IDs are assigned NA.

For get_comptab, a data frame is returned invisibly containing the columns ‘⁠dataset⁠’, ‘⁠description⁠’, ‘⁠n1⁠’ (number of down-regulated proteins), ‘⁠n2⁠’ (number of up-regulated proteins), followed two sets of columns for the variables. These are denoted generically as (‘⁠var.mfun1⁠’, ‘⁠var.mfun2⁠’, ‘⁠var.diff⁠’, ‘⁠var.CLES⁠’, ‘⁠var.p.value⁠’), where ‘⁠var⁠’ is replaced by the name of var1 or var2, and ‘⁠mfun⁠’ is replaced by the value of mfun. For example, ‘⁠Zc.median1⁠’ and ‘⁠Zc.median2⁠’ are the median Z_C of the down- and up-regulated proteins, respectively.

For xsummary, (invisibly) the data frame used to make the table; this data frame differs from comptab by having row names added (alphabetical one-letter IDs for the datasets).

Plot style

The overall style of the plot is controlled by oldstyle.

oldstyle = FALSE

This is the current default style. Use pch and cex to control the point symbol and size. Contours are added for confidence regions of highest probability density, computed using a 2-D kernel density estimate (kde2d). probs gives the probability level(s) and col.contour sets the color(s) of the contour lines. contour can be a logical vector, indicating which points to include; set it to FALSE to omit the contour lines.

The code to calculate the contour levels is modified from HPDregionplot in the emdbook package by Ben Bolker (https://cran.r-project.org/package=emdbook).

oldstyle = TRUE

This style was used for the historical (2017) vignettes, which have been moved to the ‘⁠extdata/cpcp⁠’ directory in JMDplots (https://github.com/jedick/JMDplots). For each dataset, the point symbol is a filled square if the p-values of both the x-variable and y-variable are less than 0.05, a filled circle if the p-value of one of the x- or y-variables is less than 0.05, and an open circle otherwise. A solid line is drawn from the point to the corresponding axis if the rounded, absolute value of (CLES in percent - 50) of the x- or y-variable is greater than or equal 10. Otherwise, a dashed line is drawn from the point to the corresponding axis if the p-value of the x- or y-variable is less than 0.05. Otherwise, no line is drawn.

References

Dick, J. M., LaRowe, D. E. and Helgeson, H. C. (2006) Temperature, pressure, and electrochemical constraints on protein speciation: Group additivity calculation of the standard molal thermodynamic properties of ionized unfolded proteins. Biogeosciences 3, 311–336. doi:10.5194/bg-3-311-2006

Jimenez, C. R. and Knol, J. C. and Meijer, G. A. and Fijneman, R. J. A. (2010) Proteomics of colorectal cancer: Overview of discovery studies and identification of commonly identified cancer-associated proteins and candidate CRC serum markers. J. Proteomics 73, 1873–1895. doi:10.1016/j.jprot.2010.06.004

Examples

## Not run: 
mkvig("osmotic_gene")

## End(Not run)

# Synthetic data to show actions for incorrect IDs
ID <- c("P61247;PXXXXX", "PYYYYY;P46777;P60174", "PZZZZZ")
dat <- data.frame(ID = ID, stringsAsFactors = FALSE)
# Get the first known ID for each protein; the third one is NA
check_IDs(dat, "ID")
# Update an old ID
dat <- data.frame(Entry = "P50224", stringsAsFactors = FALSE)
check_IDs(dat, "Entry")

# Set up a workflow to clean data and retrieve amino acid compositions
extdatadir <- system.file("extdata", package = "JMDplots")
datadir <- paste0(extdatadir, "/diffexpr/pancreatic/")
dataset <- "CYD+05"
dat <- read.csv(paste0(datadir, dataset, ".csv.xz"), as.is = TRUE)
up2 <- dat$Ratio..cancer.normal. > 1
# Remove two unavailable and one duplicated proteins
dat <- cleanup(dat, "Entry", up2)
# Now we can retrieve the amino acid compositions
aa <- canprot::human_aa(dat$Entry)

# Read another data file
datadir <- paste0(system.file("extdata", package = "JMDplots"), "/diffexpr/colorectal/")
dataset <- "STK+15"
dat <- read.csv(paste0(datadir, "STK+15.csv.xz"), as.is = TRUE)
# Remove unavailable proteins
dat <- cleanup(dat, "uniprot")
# Remove proteins that have less than 2-fold expression ratio
dat <- cleanup(dat, abs(log2(dat$invratio)) < 1)

# Analysis of differential expression data
pd <- pdat_colorectal("JKMF10")
# Default variables: Zc and nH2O
get_comptab(pd, plot.it = TRUE)
# Protein length and per-residue volume
get_comptab(pd, "nAA", "V0", plot.it = TRUE)

# Make an old-style plot for two datasets
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
  pdat <- pdat_colorectal(dataset)
  get_comptab(pdat, oldstyle = TRUE)
})
diffplot(comptab, oldstyle = TRUE)

# Plot the data of Jimenez et al., 2010 for colorectal cancer
pdat <- pdat_colorectal("JKMF10")
qdist(pdat)

# Making a table
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
  pdat <- pdat_colorectal(dataset)
  get_comptab(pdat, oldstyle = TRUE)
})
xsummary(comptab)

[Package JMDplots version 1.2.19-14 Index]