canprot {JMDplots} | R Documentation |
These functions were moved from the canprot package in February 2024.
mkvig(vig = NULL)
check_IDs(dat, IDcol, aa_file = NULL, updates_file = NULL)
cleanup(dat, IDcol, up2 = NULL)
get_comptab(pdat, var1 = "Zc", var2 = "nH2O", plot.it = FALSE,
mfun = "median", oldstyle = FALSE)
diffplot(comptab, vars = c("Zc", "nH2O"), col = "black", plot.rect = FALSE,
pt.text = c(letters, LETTERS), cex.text = 0.85, oldstyle = FALSE,
pch = 1, cex = 2.1, contour = TRUE, col.contour = par("fg"),
probs = 0.5, add = FALSE, labtext = NULL, ...)
qdist(pdat, vars = c("Zc", "nH2O"), show.steps = FALSE)
xsummary(comptab, vars = c("Zc", "nH2O"))
xsummary2(comptab1, comptab2)
xsummary3(comptab1, comptab2, comptab3)
vig |
character, name of a vignette without ‘.Rmd’ extension |
dat |
data frame, protein expression data |
IDcol |
character, name of column that has the UniProt IDs |
aa_file |
character, name of file with additional amino acid compositions |
updates_file |
character, name of file with old to new ID mappings |
up2 |
logical, TRUE for up-regulated proteins, FALSE for down-regulated proteins |
pdat |
list, data object generated by a |
var1 |
character, the first variable |
var2 |
character, the second variable |
plot.it |
logical, make a scatterplot? |
mfun |
character, either ‘median’ or ‘mean’ |
oldstyle |
logical, also calculate |
comptab |
list or data frame, output of |
vars |
character, which variables (chemical metrics) to calculate or plot |
col |
character or numeric, color(s) for the points |
plot.rect |
logical, plot a reference rectangle? |
pt.text |
character, text labels for the points |
cex.text |
numeric, size of text labels |
pch |
numeric, point symbol |
cex |
numeric, point size |
contour |
logical, add contour lines? |
col.contour |
character or numeric, color of contour lines |
probs |
numeric, probability level(s) for contours |
add |
logical, add to an existing plot? |
labtext |
character, text to add to axis labels |
... |
other argumenents passed to |
show.steps |
logical, show the steps using |
comptab1 |
list, output of |
comptab2 |
list, output of |
comptab3 |
list, output of |
These functions for processing differential expression datasets for the gradH2O and canH2O papers were previously in the canprot package.
mkvig
compiles the indicated vignette for chemical analysis of differential expression datasets and opens it in the browser.
Pandoc (including pandoc-citeproc), as a system dependency of rmarkdown, must be installed.
See rmarkdown's ‘pandoc’ vignette for installation tips.
The vignettes can also be run using e.g. demo("HPA")
, and through the interactive help system (help.start
> Packages > JMDplots > Code demos).
The available vignettes are listed here:
Cell culture – ‘hypoxia’, ‘secreted’, ‘osmotic_bact’, ‘osmotic_euk’, ‘osmotic_halo’, ‘glucose’, ‘3D’, ‘osmotic_gene’, ‘yeast_stress’
Cancer – ‘breast’, ‘colorectal’, ‘liver’, ‘lung’, ‘pancreatic’, ‘prostate’
Pan-cancer – ‘TCGA’, ‘HPA’
check_IDs
is used to check for known UniProt IDs and to update obsolete IDs.
The source IDs should be provided in the IDcol
column of dat
; multiple IDs for one protein can be separated by a semicolon.
The function keeps the first “known” ID for each protein, which must be present in one of these groups:
The human.aa
dataset of amino acid compositions.
Old UniProt IDs that are mapped to new UniProt IDs in ‘extdata/diffexpr/uniprot_updates.csv’ or in updates_file
if specified.
IDs of proteins in aa_file
, which lists amino acid compositions in the format described for human.aa
(see thermo$protein
for details).
cleanup
removes proteins with unavailable IDs, ambiguous expression ratios, and duplicated IDs.
IDcol
is the name of the column that has the UniProt IDs, and up2
indicates the expression change for each protein.
The function removes proteins with unavailable (NA or "") or duplicated IDs.
If up2
is provided, the function also removes unquantified proteins (those that have NA values of up2
) and those with ambiguous expression ratios (up and down for the same ID).
For each operation, a message is printed describing the number of proteins that are ‘unavailable’, ‘unquantified’, ‘ambiguous’, or ‘duplicated’.
Alternatively, if IDcol
is a logical value, it selects proteins to be unconditionally removed.
get_comptab
computes differences of chemical metrics between groups of up- and down-regulated proteins.
Differentially expressed proteins are identified by the value of pdat$up2
(TRUE for up-regulated proteins and FALSE for down-regulated proteins).
The differences are calculated as (median for up-regulated proteins) - (median for down-regulated proteins).
If mfun
is ‘mean’, means of the groups are used instead.
If oldstyle
is TRUE, the function also calculates the common language effect size (CLES
, in percent) and p-value for each variable.
Volume is calculated using amino acid group additivity as described by Dick et al. (2006).
Set plot.it
to TRUE
to make a scatterplot.
Open red squares and filled blue circles stand for up-regulated and down-regulated proteins, respectively.
diffplot
makes a plot with points showing the differences between up- and down-regulated proteins for two chemical metrics, as calculated by get_comptab
.
The default setting of vars
refers to average oxidation state of carbon (ZC) as the x-variable and stoichiometric hydration state (nH2O) as the y-variable.
The colors of the points are controlled by col
, which is recycled to be equal to the number of comparisons in comptab
.
If plot.rect
is TRUE, a shaded rect
angle is drawn with coordinates -0.01, -0.01, 0.01, 0.01.
This is useful for visualizing the different scales of multi-panel plots.
If pt.text
is not NA or FALSE, text
labels are added with size controlled by cex.text
.
The default value produces labels that are taken sequentially from the 26 lowercase Roman letters in alphabetical order (letters
), followed by the set of uppercase letters (LETTERS
).
For labtext
= NULL, descriptive text (“median difference” or “mean difference”) is added to the axis labels in parentheses.
This text can be changed by giving a value in labtext
(for both axes), two values (for each axis), or NA to suppress the text.
cplab
is a list of formatted labels used by diffplot
.
It is an exported object, available to the user and other packages.
qdist
makes a quantile distribution plot with lines for both up- and down-regulated proteins.
The variable (var
) can be ‘Zc’, ‘H2O’, or both (two plots are made for the latter).
The horizontal axis is the variable and the vertical axis is the quantile point.
A solid black line is drawn for the down-regulated proteins, and a dashed red line for the up-regulated proteins.
The median difference is shown by a gray horizontal line drawn between the distributions at the 0.5 quantile point.
xsummary
makes an HTML table summarizing chemical differences using xtable
.
Bold and underline formatting is used to highlight significant chemical differences.
The p-value is bolded if it is less than 0.05, and the percent common language effect size (CLES
) is bolded if it is <= 40 or >= 60.
The mean (or median) difference is [underlined / bolded] if [only one of / both] the p-value and CLES pass these cutoffs.
The generated table is written to the console, and can be used in a vignette using the results = "asis"
chunk option.
xsummary2
is an updated version that is used in the current vignettes in the package.
It shows negative numbers in bold (p-value and CLES are not shown).
xsummary3
is a further revision that shows GRAVY and pI; it is used in the ‘osmotic_bact’ and ‘osmotic_halo’ vignettes.
For check_IDs
, dat
is returned with possibly changed values in the column designated by IDcol
; old IDs are replaced with new ones, the first known ID for each protein is kept, then proteins with no known IDs are assigned NA
.
For get_comptab
, a data frame is returned invisibly containing the columns ‘dataset’, ‘description’, ‘n1’ (number of down-regulated proteins), ‘n2’ (number of up-regulated proteins), followed two sets of columns for the variables.
These are denoted generically as (‘var.mfun1’, ‘var.mfun2’, ‘var.diff’, ‘var.CLES’, ‘var.p.value’), where ‘var’ is replaced by the name of var1
or var2
, and ‘mfun’ is replaced by the value of mfun
.
For example, ‘Zc.median1’ and ‘Zc.median2’ are the median ZC of the down- and up-regulated proteins, respectively.
For xsummary
, (invisibly) the data frame used to make the table; this data frame differs from comptab
by having row names added (alphabetical one-letter IDs for the datasets).
The overall style of the plot is controlled by oldstyle
.
oldstyle = FALSE
This is the current default style.
Use pch
and cex
to control the point symbol and size.
Contours are added for confidence regions of highest probability density, computed using a 2-D kernel density estimate (kde2d
).
probs
gives the probability level(s) and col.contour
sets the color(s) of the contour lines.
contour
can be a logical vector, indicating which points to include; set it to FALSE to omit the contour lines.
The code to calculate the contour levels is modified from HPDregionplot
in the emdbook package by Ben Bolker (https://cran.r-project.org/package=emdbook).
oldstyle = TRUE
This style was used for the historical (2017) vignettes, which have been moved to the ‘extdata/cpcp’ directory in JMDplots (https://github.com/jedick/JMDplots).
For each dataset, the point symbol is a filled square if the p-values of both the x-variable and y-variable are less than 0.05, a filled circle if the p-value of one of the x- or y-variables is less than 0.05, and an open circle otherwise.
A solid line is drawn from the point to the corresponding axis if the rounded, absolute value of (CLES
in percent - 50) of the x- or y-variable is greater than or equal 10.
Otherwise, a dashed line is drawn from the point to the corresponding axis if the p-value of the x- or y-variable is less than 0.05.
Otherwise, no line is drawn.
Dick, J. M., LaRowe, D. E. and Helgeson, H. C. (2006) Temperature, pressure, and electrochemical constraints on protein speciation: Group additivity calculation of the standard molal thermodynamic properties of ionized unfolded proteins. Biogeosciences 3, 311–336. doi:10.5194/bg-3-311-2006
Jimenez, C. R. and Knol, J. C. and Meijer, G. A. and Fijneman, R. J. A. (2010) Proteomics of colorectal cancer: Overview of discovery studies and identification of commonly identified cancer-associated proteins and candidate CRC serum markers. J. Proteomics 73, 1873–1895. doi:10.1016/j.jprot.2010.06.004
## Not run:
mkvig("osmotic_gene")
## End(Not run)
# Synthetic data to show actions for incorrect IDs
ID <- c("P61247;PXXXXX", "PYYYYY;P46777;P60174", "PZZZZZ")
dat <- data.frame(ID = ID, stringsAsFactors = FALSE)
# Get the first known ID for each protein; the third one is NA
check_IDs(dat, "ID")
# Update an old ID
dat <- data.frame(Entry = "P50224", stringsAsFactors = FALSE)
check_IDs(dat, "Entry")
# Set up a workflow to clean data and retrieve amino acid compositions
extdatadir <- system.file("extdata", package = "JMDplots")
datadir <- paste0(extdatadir, "/diffexpr/pancreatic/")
dataset <- "CYD+05"
dat <- read.csv(paste0(datadir, dataset, ".csv.xz"), as.is = TRUE)
up2 <- dat$Ratio..cancer.normal. > 1
# Remove two unavailable and one duplicated proteins
dat <- cleanup(dat, "Entry", up2)
# Now we can retrieve the amino acid compositions
aa <- canprot::human_aa(dat$Entry)
# Read another data file
datadir <- paste0(system.file("extdata", package = "JMDplots"), "/diffexpr/colorectal/")
dataset <- "STK+15"
dat <- read.csv(paste0(datadir, "STK+15.csv.xz"), as.is = TRUE)
# Remove unavailable proteins
dat <- cleanup(dat, "uniprot")
# Remove proteins that have less than 2-fold expression ratio
dat <- cleanup(dat, abs(log2(dat$invratio)) < 1)
# Analysis of differential expression data
pd <- pdat_colorectal("JKMF10")
# Default variables: Zc and nH2O
get_comptab(pd, plot.it = TRUE)
# Protein length and per-residue volume
get_comptab(pd, "nAA", "V0", plot.it = TRUE)
# Make an old-style plot for two datasets
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
pdat <- pdat_colorectal(dataset)
get_comptab(pdat, oldstyle = TRUE)
})
diffplot(comptab, oldstyle = TRUE)
# Plot the data of Jimenez et al., 2010 for colorectal cancer
pdat <- pdat_colorectal("JKMF10")
qdist(pdat)
# Making a table
comptab <- lapply(c("JKMF10", "WDO+15_C.N"), function(dataset) {
pdat <- pdat_colorectal(dataset)
get_comptab(pdat, oldstyle = TRUE)
})
xsummary(comptab)