JMDplots-package {JMDplots} | R Documentation |
This package contains data and code used to make the plots in various papers. Each paper is identified with a project name, as listed in the table below. The plots are available in the corresponding vignettes in the package.
chem16S
- Community-level chemical metrics (Dick and Kang, 2023)
orp16S
- Influence of redox potential on bacterial protein evolution (Dick and Meng, 2023)
geo16S
- Chemical links between redox conditions and community reference proteomes (Dick and Tan, 2023)
utogig
- Using thermodynamics to obtain geochemical information from genomes (Dick et al., 2023)
evdevH2O
- Thermodynamic model for water activity and redox potential in evolution and development (Dick, 2022)
mjenergy
- Energy release in protein synthesis (Dick and Shock, 2021)
canH2O
- Water as a reactant in the differential expression of proteins in cancer (Dick, 2021)
gradH2O
- Stoichiometric hydration state of metagenomes in salinity gradients (Dick et al., 2020)
chnosz10
- CHNOSZ, ten years after first CRAN submission (Dick, 2019)
gradox
- Carbon oxidation state of metagenomes in redox gradients (Dick et al., 2019)
cpcp
- Potential diagrams for cancer proteomes (Dick, 2016, 2017)
aoscp
- Average oxidation state of carbon in proteins (Dick, 2014)
bison
- Bison Pool hot spring (Dick and Shock, 2011, 2014)
scsc
- Subcellular locations of Saccharomyces cerevisiae (Dick, 2009)
aaaq
- Amino acid group additivity for ionized proteins (Dick et al., 2006)
These names are used for the vignettes and functions; the function names have figure numbers appended, as in gradox1
.
Data for each of the papers are stored in the corresponding directories under extdata
; see the documentation page for each paper for more details.
There are some other directories, described below.
subsurface
Metagenome-derived amino acid compositions for two subsurface environments.
These files are provided to illustrate the usage of user-provided data sets; see subsurface
.
RefDB/organisms
Sce.csv.xz
Data frame of amino acid composition of 6716 proteins from the Saccharomyces Genome Database (SGD).
Values in the first three columns are the ORF
names of proteins, SGDID
, and GENE
names.
The remaining twenty columns (ALA
..VAL
) contain the numbers of the respective amino acids in each protein.
The sources of data for ‘Sce.csv’ are the files ‘protein_properties.tab’ and ‘SGD_features.tab’ (for the gene names), downloaded from https://www.yeastgenome.org/ on 2013-08-24.
A shorter version of this file was previously present in CHNOSZ (to version 1.3.3).
Used in yeast.aa
.
UP000000805_243232.csv.xz
Amino acid compositions of 1787 proteins in Methanocaldococcus jannaschii.
It was created by processing the file UP000000805_243232.fasta.gz
, which was downloaded from the UniProt reference proteomes FTP site (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Archaea/).
The server timestamp on UP000000805_243232.fasta.gz
was Dec. 2, 2020; the reference proteome was last modified on August 22, 2020 according to https://www.uniprot.org/proteomes/UP000000805.
The commands used to create UP000000805_243232.csv.gz
are:
aa <- canprot::read_fasta("UP000000805_243232.fasta.gz")
write.csv(aa, "UP000000805_243232.csv", row.names = FALSE, quote = FALSE)
system("xz UP000000805_243232.csv")
Used in mjenergy
.
UP000000625_83333.csv.xz
This data file has amino acid compositions of 4392 proteins in the UniProt reference proteome of Escherichia coli K12 (https://www.uniprot.org/proteomes/UP000000625; last modified 2021-03-07, file timestamp 2021-06-16, accessed on 2021-07-12).
The data frame was created by running canprot::read_fasta("UP000000625_83333.fasta.gz")
.
UP000000803_7227.csv.xz
This data file has amino acid compositions of 4392 proteins in the UniProt reference proteome of Drosophila melanogaster (https://www.uniprot.org/proteomes/UP000000803; last modified 2021-03-07, file timestamp 2021-06-16, accessed on 2021-07-12).
Used in evdevH2O
.
UP000001570_224308.csv.xz
This data file has amino acid compositions of 4392 proteins in the UniProt reference proteome of Bacillus subtilis strain 168 (https://www.uniprot.org/proteomes/UP000001570; last modified 2021-03-09, file timestamp 2021-06-16, accessed on 2021-07-12).
Used in evdevH2O
.
yeastgfp.csv.xz
Has 28 columns; the names of the first five are yORF
, gene name
, GFP tagged?
, GFP visualized?
, and abundance
.
The remaining columns correspond to the 23 subcellular localizations considered in the YeastGFP project (Huh et al., 2003 and Ghaemmaghami et al., 2003) and hold values of either T
or F
for each protein.
‘yeastgfp.csv’ was downloaded on 2007-02-01 from http://yeastgfp.ucsf.edu using the Advanced Search, setting options to download the entire dataset and to include localization table and abundance, sorted by orf number.
Used in yeastgfp
.
This directory also has subcellular location data for yeast; see yeast
for more information.
vignettes
This directory has vignettes for differential expression data: TCGA.Rmd, HPA.Rmd, and osmotic_gene.Rmd.
The CSV files generated by these vignettes are also kept here; they are used for plots in canH2O
.
OBIGT
The OldAA.csv
file has thermodynamic data for glycine and methionine, [Gly] and [Met] sidechain groups, and the protein backbone group from Dick et al. (2006).
These data have been superseded in the default OBIGT database in CHNOSZ and are kept here in order to reproduce calculations from some papers (aaaq
, scsc
, and bison
).
RefDB/RefSeq
‘genome_AA.csv.xz’ has amino acid compositions of species-level archaeal, bacterial, and viral taxa in the RefSeq database, and ‘taxonomy.csv.xz’ has taxonomic names for each of those species.
The scripts to produce these files are in the extdata/RefSeq
directory of chem16S (see chem16S-package
).
‘taxon_metrics.R’ and ‘taxon_metrics.csv.xz’ are script and output of selected chemical metrics (ZC and nH2O) of reference proteomes for taxa at genus and higher ranks.
RefDB/GTDB
‘genome_AA.csv.xz’ has amino acid compositions of predicted proteins from GTDB, and ‘taxonomy.csv.xz’ has taxonomic names for each of those species.
The scripts to produce these files are provided in chem16S (see chem16S-package
).
RefDB/UHGG
‘MGnify_genomes.csv’ lists all 4744 species-level clusters in the Unified Human Gastrointestinal Genome (UHGG v.2.0.1) from MGnify, obtained from https://www.ebi.ac.uk/metagenomics/genome-catalogues/human-gut-v2-0-1 on 2023-12-29. ‘getMGnify.R’ has the commands used to download FASTA files for proteins and to scrape the website for taxonomic information. ‘taxonomy.csv.xz’ has the taxonomy for 2350 selected genomes with contamination < 2 ‘genome_AA.R’ calculates amino acid compositions of the selected genomes from FASTA files and writes the output file ‘genome_AA.csv.xz’. ‘taxon_AA.R’ combines amino acid compositions of genomes to generate reference proteomes for genera and higher taxonomic levels and writes the output file ‘taxon_AA.csv.xz’. ‘fullset’ has versions of ‘taxonomy.csv.xz’, ‘genome_AA.csv.xz’, and ‘taxon_AA.csv.xz’ for the full set of 4744 genomes.
Dick JM, LaRowe DE and Helgeson HC (2006) Temperature, pressure, and electrochemical constraints on protein speciation: Group additivity calculation of the standard molal thermodynamic properties of ionized unfolded proteins. Biogeosciences 3, 311–336. doi:10.5194/bg-3-311-2006
Dick JM (2009) Calculation of the relative metastabilities of proteins in subcellular compartments of Saccharomyces cerevisiae. BMC Syst. Biol. 3, 75. doi:10.1186/1752-0509-3-75
Dick JM and Shock EL (2011) Calculation of the relative chemical stabilities of proteins as a function of temperature and redox chemistry in a hot spring. PLOS One 6, e22782. doi:10.1371/journal.pone.0022782
Dick JM and Shock EL (2013) A metastable equilibrium model for the relative abundance of microbial phyla in a hot spring. PLOS One 8, e72395. doi:10.1371/journal.pone.0072395
Dick JM (2014) Average oxidation state of carbon in proteins. J. R. Soc. Interface 11, 20131095. doi:10.1098/rsif.2013.1095
Dick JM (2016) Proteomic indicators of oxidation and hydration state in colorectal cancer. PeerJ 4, e2238. doi:10.7717/peerj.2238
Dick JM (2017) Chemical composition and the potential for proteomic transformation in cancer, hypoxia, and hyperosmotic stress. PeerJ 5, e3421 doi:10.7717/peerj.3421
Dick JM, Yu M, Tan J and Lu A (2019) Changes in carbon oxidation state of metagenomes along geochemical redox gradients. Front. Microbiol. 10, 120. doi:10.3389/fmicb.2019.00120
Dick JM (2019) CHNOSZ: Thermodynamic calculations and diagrams for geochemistry. Front. Earth Sci. 7:180. doi:10.3389/feart.2019.00180
Dick JM, Yu M and Tan J (2020) Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17, 6145–6162. doi:10.5194/bg-17-6145-2020
Dick JM (2021) Water as a reactant in the differential expression of proteins in cancer. Comp. Sys. Onco. 1:e1007. doi:10.1002/cso2.1007
Dick JM and Shock EL (2021) The release of energy during protein synthesis at ultramafic-hosted submarine hydrothermal ecosystems. J. Geophys. Res.: Biogeosciences 126, e2021JG006436. doi:10.1029/2021JG006436
Dick JM (2022) A thermodynamic model for water activity and redox potential in evolution and developent. J. Mol. Evol 90, 182–199. doi:10.1007/s00239-022-10051-7
Dick JM, Boyer GM, Canovas PA III and Shock EL (2023) Using thermodynamics to obtain geochemical information from genomes. Geobiology 21, 262–273. doi:10.1111/gbi.12532
Dick JM and Tan J (2023) Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microb. Ecol. 85, 1338–1355. doi:10.1007/s00248-022-01988-9
Dick JM and Meng D (2023) Community- and genome-based evidence for a shaping influence of redox potential on bacterial protein evolution. mSystems 8, e00014-23. doi:10.1128/msystems.00014-23
Dick JM and Kang X (2023) chem16S: community-level chemical metrics for exploring genomic adaptation to environments. Bioinformatics 39, btad564. doi:10.1093/bioinformatics/btad564