canprot {JMDplots}R Documentation

Functions for processing differential expression datasets

Description

These functions were moved from the canprot package in February 2024.

Usage

  mkvig(vig = NULL)
  check_IDs(dat, IDcol, aa_file = NULL, updates_file = NULL)
  cleanup(dat, IDcol, up2 = NULL)
  get_comptab(pdat, var1 = "Zc", var2 = "nH2O", plot.it = FALSE,
    mfun = "median", oldstyle = FALSE)
  diffplot(comptab, vars = c("Zc", "nH2O"), col = "black", plot.rect = FALSE,
           pt.text = c(letters, LETTERS), cex.text = 0.85, oldstyle = FALSE,
           pch = 1, cex = 2.1, contour = TRUE, col.contour = par("fg"),
           probs = 0.5, add = FALSE, labtext = NULL, ...)
  qdist(pdat, vars = c("Zc", "nH2O"), show.steps = FALSE)
  xsummary(comptab, vars = c("Zc", "nH2O"))
  xsummary2(comptab1, comptab2)
  xsummary3(comptab1, comptab2, comptab3)

Arguments

vig

character, name of a vignette without ‘⁠.Rmd⁠’ extension

dat

data frame, protein expression data

IDcol

character, name of column that has the UniProt IDs

aa_file

character, name of file with additional amino acid compositions

updates_file

character, name of file with old to new ID mappings

up2

logical, TRUE for up-regulated proteins, FALSE for down-regulated proteins

pdat

list, data object generated by a pdat_ function

var1

character, the first variable

var2

character, the second variable

plot.it

logical, make a scatterplot?

mfun

character, either ‘⁠median⁠’ or ‘⁠mean⁠

oldstyle

logical, also calculate CLES and p-values?

comptab

list or data frame, output of get_comptab

vars

character, which variables (chemical metrics) to calculate or plot

col

character or numeric, color(s) for the points

plot.rect

logical, plot a reference rectangle?

pt.text

character, text labels for the points

cex.text

numeric, size of text labels

pch

numeric, point symbol

cex

numeric, point size

contour

logical, add contour lines?

col.contour

character or numeric, color of contour lines

probs

numeric, probability level(s) for contours

add

logical, add to an existing plot?

labtext

character, text to add to axis labels

...

other argumenents passed to plot

show.steps

logical, show the steps using plot.ecdf?

comptab1

list, output of get_comptab

comptab2

list, output of get_comptab

comptab3

list, output of get_comptab

Details

These functions for processing differential expression datasets for the gradH2O and canH2O papers were previously in the canprot package.

mkvig compiles the indicated vignette for chemical analysis of differential expression datasets and opens it in the browser. Pandoc (including pandoc-citeproc), as a system dependency of rmarkdown, must be installed. See rmarkdown's ‘⁠pandoc⁠’ vignette for installation tips. The vignettes can also be run using e.g. demo("HPA"), and through the interactive help system (help.start > Packages > JMDplots > Code demos). The available vignettes are listed here:

check_IDs is used to check for known UniProt IDs and to update obsolete IDs. The source IDs should be provided in the IDcol column of dat; multiple IDs for one protein can be separated by a semicolon. The function keeps the first “known” ID for each protein, which must be present in one of these groups:

cleanup removes proteins with unavailable IDs, ambiguous expression ratios, and duplicated IDs. IDcol is the name of the column that has the UniProt IDs, and up2 indicates the expression change for each protein. The function removes proteins with unavailable (NA or "") or duplicated IDs. If up2 is provided, the function also removes unquantified proteins (those that have NA values of up2) and those with ambiguous expression ratios (up and down for the same ID). For each operation, a message is printed describing the number of proteins that are ‘⁠unavailable⁠’, ‘⁠unquantified⁠’, ‘⁠ambiguous⁠’, or ‘⁠duplicated⁠’. Alternatively, if IDcol is a logical value, it selects proteins to be unconditionally removed.

get_comptab computes differences of chemical metrics between groups of up- and down-regulated proteins.

diffplot makes a plot with points showing the differences between up- and down-regulated proteins for two chemical metrics, as calculated by get_comptab.

qdist makes a quantile distribution plot with lines for both up- and down-regulated proteins. The variable (var) can be ‘⁠Zc⁠’, ‘⁠H2O⁠’, or both (two plots are made for the latter). The horizontal axis is the variable and the vertical axis is the quantile point. A solid black line is drawn for the down-regulated proteins, and a dashed red line for the up-regulated proteins. The median difference is shown by a gray horizontal line drawn between the distributions at the 0.5 quantile point.

xsummary makes an HTML table summarizing chemical differences using xtable. Bold and underline formatting is used to highlight significant chemical differences. The p-value is bolded if it is less than 0.05, and the percent common language effect size (CLES) is bolded if it is <= 40 or >= 60. The mean (or median) difference is [underlined / bolded] if [only one of / both] the p-value and CLES pass these cutoffs. The generated table is written to the console, and can be used in a vignette using the results = "asis" chunk option.

xsummary2 is an updated version that is used in the current vignettes in the package. It shows negative numbers in bold (p-value and CLES are not shown). xsummary3 is a further revision that shows GRAVY and pI; it is used in the ‘⁠osmotic_bact⁠’ and ‘⁠osmotic_halo⁠’ vignettes.

Value

For check_IDs, dat is returned with possibly changed values in the column designated by IDcol; old IDs are replaced with new ones, the first known ID for each protein is kept, then proteins with no known IDs are assigned NA.

For get_comptab, a data frame is returned invisibly containing the columns ‘⁠dataset⁠’, ‘⁠description⁠’, ‘⁠n1⁠’ (number of down-regulated proteins), ‘⁠n2⁠’ (number of up-regulated proteins), followed two sets of columns for the variables. These are denoted generically as (‘⁠var.mfun1⁠’, ‘⁠var.mfun2⁠’, ‘⁠var.diff⁠’, ‘⁠var.CLES⁠’, ‘⁠var.p.value⁠’), where ‘⁠var⁠’ is replaced by the name of var1 or var2, and ‘⁠mfun⁠’ is replaced by the value of mfun. For example, ‘⁠Zc.median1⁠’ and ‘⁠Zc.median2⁠’ are the median ZC of the down- and up-regulated proteins, respectively.

For xsummary, (invisibly) the data frame used to make the table; this data frame differs from comptab by having row names added (alphabetical one-letter IDs for the datasets).

Plot style

The overall style of the plot is controlled by oldstyle.

oldstyle = FALSE

This is the current default style. Use pch and cex to control the point symbol and size. Contours are added for confidence regions of highest probability density, computed using a 2-D kernel density estimate (kde2d). probs gives the probability level(s) and col.contour sets the color(s) of the contour lines. contour can be a logical vector, indicating which points to include; set it to FALSE to omit the contour lines.

The code to calculate the contour levels is modified from HPDregionplot in the emdbook package by Ben Bolker (https://cran.r-project.org/package=emdbook).

oldstyle = TRUE

This style was used for the historical (2017) vignettes, which have been moved to the ‘⁠extdata/cpcp⁠’ directory in JMDplots (https://github.com/jedick/JMDplots). For each dataset, the point symbol is a filled square if the p-values of both the x-variable and y-variable are less than 0.05, a filled circle if the p-value of one of the x- or y-variables is less than 0.05, and an open circle otherwise. A solid line is drawn from the point to the corresponding axis if the rounded, absolute value of (CLES in percent - 50) of the x- or y-variable is greater than or equal 10. Otherwise, a dashed line is drawn from the point to the corresponding axis if the p-value of the x- or y-variable is less than 0.05. Otherwise, no line is drawn.

References

Dick, J. M., LaRowe, D. E. and Helgeson, H. C. (2006) Temperature, pressure, and electrochemical constraints on protein speciation: Group additivity calculation of the standard molal thermodynamic properties of ionized unfolded proteins. Biogeosciences 3, 311–336. doi:10.5194/bg-3-311-2006

Jimenez, C. R. and Knol, J. C. and Meijer, G. A. and Fijneman, R. J. A. (2010) Proteomics of colorectal cancer: Overview of discovery studies and identification of commonly identified cancer-associated proteins and candidate CRC serum markers. J. Proteomics 73, 1873–1895. doi:10.1016/j.jprot.2010.06.004

See Also

pdat_

Examples

## Not run: 
mkvig("osmotic_gene")

## End(Not run)

# Synthetic data to show actions for incorrect IDs
ID <- c("P61247;PXXXXX", "PYYYYY;P46777;P60174", "PZZZZZ")
dat <- data.frame(ID = ID, stringsAsFactors = FALSE)
# Get the first known ID for each protein; the third one is NA
check_IDs(dat, "ID")
# Update an old ID
dat <- data.frame(Entry = "P50224", stringsAsFactors = FALSE)
check_IDs(dat, "Entry")

# Set up a workflow to clean data and retrieve amino acid compositions
extdatadir <- system.file("extdata", package = "JMDplots")
datadir <- paste0(extdatadir, "/diffexpr/pancreatic/")
dataset <- "CYD+05"
dat <- read.csv(paste0(datadir, dataset, ".csv.xz"), as.is = TRUE)
up2 <- dat$Ratio..cancer.normal. > 1
# Remove two unavailable and one duplicated proteins
dat <- cleanup(dat, "Entry", up2)
# Now we can retrieve the amino acid compositions
aa <- canprot::human_aa(dat$Entry)

# Read another data file
datadir <- paste0(system.file("extdata", package = "JMDplots"), "/diffexpr/colorectal/")
dataset <- "STK+15"
dat <- read.csv(paste0(datadir, "STK+15.csv.xz"), as.is = TRUE)
# Remove unavailable proteins
dat <- cleanup(dat, "uniprot")
# Remove proteins that have less than 2-fold expression ratio
dat <- cleanup(dat, abs(log2(dat$invratio)) < 1)

# Analysis of differential expression data
pd <- pdat_colorectal("JKMF10")
# Default variables: Zc and nH2O
get_comptab(pd, plot.it = TRUE)
# Protein length and per-residue volume
get_comptab(pd, "nAA", "V0", plot.it = TRUE)

# Make an old-style plot for two datasets
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
  pdat <- pdat_colorectal(dataset)
  get_comptab(pdat, oldstyle = TRUE)
})
diffplot(comptab, oldstyle = TRUE)

# Plot the data of Jimenez et al., 2010 for colorectal cancer
pdat <- pdat_colorectal("JKMF10")
qdist(pdat)

# Making a table
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
  pdat <- pdat_colorectal(dataset)
  get_comptab(pdat, oldstyle = TRUE)
})
xsummary(comptab)


[Package JMDplots version 1.2.19-14 Index]