PhyloProfileData 1.0.0
The PhyloProfileData package contains two experimental datasets to illustrate running and analysing phylogenetic profiles with PhyloProfile pakage (Tran et al. 2018).
library(ExperimentHub)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect,
## is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
## pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unsplit, which,
## which.max, which.min
## Loading required package: AnnotationHub
## Loading required package: BiocFileCache
## Loading required package: dbplyr
eh = ExperimentHub()
## snapshotDate(): 2019-10-22
myData <- query(eh, "PhyloProfileData")
myData
## ExperimentHub with 6 records
## # snapshotDate(): 2019-10-22
## # $dataprovider: Applied Bioinformatics Dept., Goethe University Frankfurt
## # $species: NA
## # $rdataclass: data.frame, AAStringSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH2544"]]'
##
## title
## EH2544 | Phylogenetic profiles of human AMPK-TOR pathway
## EH2545 | FASTA sequences for proteins in the phylogenetic profiles of ...
## EH2546 | Domain annotations for proteins in the phylogenetic profiles ...
## EH2547 | Phylogenetic profiles of BUSCO arthropoda proteins
## EH2548 | FASTA sequences for proteins in the phylogenetic profiles of ...
## EH2549 | Domain annotations for proteins in the phylogenetic profiles ...
The phylogenetic profiles of 147 human proteins in the AMPK-TOR pathway across 83 species in the three domains of life were taken from the study of Roustan et al. 2016.
This data set includes 3 files:
ampkTorPhyloProfile <- myData[["EH2544"]]
head(ampkTorPhyloProfile)
## geneID ncbiID orthoID FAS_F
## 1 ampk_ACACA ncbi284812 ampk_ACACA|SCHPO@284812@1|P78820|1 0.9884601
## 2 ampk_ACACA ncbi665079 ampk_ACACA|SCLS1@665079@1|A7EM01|1 0.9905497
## 3 ampk_ACACA ncbi35128 ampk_ACACA|THAPS@35128@1|B5YMF5|0 0.9058650
## 4 ampk_ACACA ncbi35128 ampk_ACACA|THAPS@35128@1|B8BVD1|1 0.9794378
## 5 ampk_ACACA ncbi7070 ampk_ACACA|TRICA@7070@1|D2A5X8|1 0.9813494
## 6 ampk_ACACA ncbi237631 ampk_ACACA|USTMA@237631@1|A0A0D1DYD5|1 0.9770244
## FAS_B
## 1 0.9907436
## 2 0.9906191
## 3 0.8169658
## 4 0.9359992
## 5 0.9843459
## 6 0.9456425
ampkTorFasta <- myData[["EH2545"]]
head(ampkTorFasta)
## A AAStringSet instance of length 6
## width seq names
## [1] 297 VGYPVMLKASWGGGGKGIRKVS...RDCVTVRGEIRTTTDYVLDLL ampk_ACACA|CHLRE@...
## [2] 2156 MLRTVKEYVAAYEGKRVIKRLL...TLLTYLDRQRIVRRGWFCFDS ampk_ACACA|MONBE@...
## [3] 2326 MPGHSTTGAAGETTPDTQDMVA...LSDKDREEAVAALRRGSIFHK ampk_ACACA|PHYRM@...
## [4] 2282 MIEINEYIKKLGGDKNIEKILI...PFISTQQKEFLFESLKKDLNK ampk_ACACA|DICDI@...
## [5] 2168 MKAMQETSSPVGFRYDSMEQLC...AIAKAAKVALDSSACAHSTAE ampk_ACACA|LEIMA@...
## [6] 3367 MINFFLSLLLFVLFFENLVVSI...FKMLSQEQRTEFLNKINSYEN ampk_ACACA|PLAF7@...
ampkTorDomain <- myData[["EH2546"]]
head(ampkTorDomain)
## seedID
## 1 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 2 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 3 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 4 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 5 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## 6 ampk_ACACA#ampk_ACACA|ANOGA@7165@1|Q7PQ11|1
## orthoID feature start end
## 1 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 pfam_CPSase_L_D2 260 462
## 2 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 pfam_ATPgrasp_Ter 121 352
## 3 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 pfam_ATPgrasp_Ter 367 469
## 4 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 pfam_Carboxyl_trans 1644 2198
## 5 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 smart_Biotin_carb_C 496 603
## 6 ampk_ACACA|ANOGA@7165@1|Q7PQ11|1 pfam_Biotin_carb_N 111 231
One fundamental step in establishing the phylogenetic profiles is searching orthologs for the query proteins in different taxa of interest. HaMStR-oneseq, an extended version of HaMStR (Ebersberger et al. 2009), has been shown to be an promising approach for sensitively predicting orthologs even in the distantly related taxa from the query species, which is required for the phylogenetic profiling of a broad range of taxa through all domains of the species tree of life. One main parameter for HaMStR-oneseq is the core ortholog group, the starting point for the orthology search. In order to set up a reliable core ortholog set that can be used for further phylogenetic profiling studies, we made use of the well-known BUSCO datasets (Simão et al. 2015). Here we represent the phylogenetic profiles of 1011 ortholog groups across 88 species, which was calculated from the BUSCO arthropoda dataset downloaded from https://busco.ezlab.org/datasets/arthropoda_odb9.tar.gz in Jan. 2018. The 88 species include 10 arthropoda species (Ladona fulva, Agrilus planipennis, Polypedilum vanderplanki, Daphnia magna, Harpegnathos saltator, Zootermopsis nevadensis, Halyomorpha halys, Heliconius melpomene, Stegodyphus mimosarum, Drosophila willistoni) downloaded from orthoDB version 10 (https://www.orthodb.org) and 78 species of the Quest for Ortholog dataset (Altenhoff et al. 2016).
This dataset includes 3 files:
arthropodaPhyloProfile <- myData[["EH2547"]]
head(arthropodaPhyloProfile)
## geneID ncbiID orthoID
## 1 97421at6656 ncbi9598 97421at6656|PANTR@9598@1|H2QTF9|1
## 2 97421at6656 ncbi321614 97421at6656|PHANO@321614@1|Q0U682|1
## 3 97421at6656 ncbi3218 97421at6656|PHYPA@3218@1|A9TGR3|1
## 4 97421at6656 ncbi319348 97421at6656|POLVAN@319348@0|319348_0:000e70|1
## 5 97421at6656 ncbi208964 97421at6656|PSEAE@208964@1|Q9HXF1|1
## 6 97421at6656 ncbi10116 97421at6656|RAT@10116@1|D3ZAT9|1
## FAS_F FAS_B
## 1 0.6872810 0.9654661
## 2 0.7087412 0.9798884
## 3 0.7544057 0.8727715
## 4 0.8062524 0.9610529
## 5 0.7979757 0.9498075
## 6 0.7340443 0.9492033
arthropodaFasta <- myData[["EH2548"]]
head(arthropodaFasta)
## A AAStringSet instance of length 6
## width seq names
## [1] 484 MATSGAFAGGSPGRGFAPRGRA...SKLHQQLLYVDRLMLQLRDYA 42842at6656|MONBE...
## [2] 535 MSTRKQYACDLACRLVQDQYGD...NEVMETSLAHLDQMIAVFNDF 42842at6656|CHLRE...
## [3] 607 MLCCLFGVQIKCALLKLLQHNV...LDRLDRAIIHLDGMLMLYRDF 42842at6656|PHYRM...
## [4] 487 MHFSGFKSVVLSCVEEYFDTTA...DQIDLIEPIYIKLVETAMLLF 42842at6656|GIAIC...
## [5] 666 MYEQKVAIDIVKESFGDDVTKV...TQTLLTVILNLDNDLLHLYSF 42842at6656|DICDI...
## [6] 579 MNKARGTEVAGFITDAAHIRAA...LDRLDFACLQLDETLMVLKDF 42842at6656|THAPS...
arthropodaDomain <- myData[["EH2549"]]
head(arthropodaDomain)
## seedID
## 1 136365at6656#136365at6656|AGRPL@224129@0|224129_0:000004|1
## 2 136365at6656#136365at6656|AGRPL@224129@0|224129_0:000004|1
## 3 136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 4 136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 5 136365at6656#136365at6656|ANOGA@7165@1|Q7QC64|1
## 6 136365at6656#136365at6656|AQUAE@224324@1|O67650|1
## orthoID length
## 1 136365at6656|AGRPL@224129@0|224129_0:000004|1 142
## 2 136365at6656|DROME@7227@1|Q86BM8 138
## 3 136365at6656|ANOGA@7165@1|Q7QC64|1 142
## 4 136365at6656|ANOGA@7165@1|Q7QC64|1 142
## 5 136365at6656|DROME@7227@1|Q86BM8 138
## 6 136365at6656|AQUAE@224324@1|O67650|1 98
## feature start end weight path
## 1 pfam_Ribosomal_L27 26 106 NA Y
## 2 pfam_Ribosomal_L27 22 104 1 Y
## 3 pfam_Ribosomal_L27 26 106 NA Y
## 4 seg_low complexity regions 37 46 NA Y
## 5 pfam_Ribosomal_L27 22 104 1 Y
## 6 pfam_Ribosomal_L27 2 80 NA Y