Apoderoides is an R package for finding and deleting erroneous taxa from a phylogenetic tree by calculating scores for each taxa. The score shows how erroneous the taxon for the monophyly of ranks and can prioritize which taxon should be deleted first. Apoderoides especially focuses on erroneous taxa caused by taxon mistake or misidentification.
You can install Apoderoides by the following commands on R console. This command requires the Internet connection.
First, please load the package by the following command:
We need a phylogenetic tree for the analysis. The next command imports a test tree included in this package.
Otherwise, please load your own phylogenetic tree by a command like this:
Let’s calculate the score of taxa in the loaded tree at genus level. Taxa with higher scores are more harmful to monophyly of genera, and such taxa are considered erroneous. The score is calculated by the next command:
calc.Score() function returns a list of size two, containing two sets of scores based on the centroid and most recent common ancestor (MRCA) of checking ranks. In the above codes, we are checking the genus rank. Here, let’s have a look at the top 10 scores of the test tree based on the centroid.
calc.Score(testTree,show_progress=FALSE)[[1]][1:10,]
#> OTU perCladeOTUScore sum intruder outlier
#> [1,] "Araucaria_cunninghamii" "248" "248" "0" "248"
#> [2,] "Thuja_orientalis" "22" "22" "2" "20"
#> [3,] "Callitropsis_vietnamensis" "10" "10" "3" "7"
#> [4,] "Callitropsis_funebris" "8" "8" "2" "6"
#> [5,] "Goniophlebium_niponicum" "6" "6" "1" "5"
#> [6,] "Dryopteris_crassirhizoma" "5" "5" "1" "4"
#> [7,] "Cyclosorus_dentatus" "5" "5" "2" "3"
#> [8,] "Juniperus_chinensis" "5" "5" "3" "2"
#> [9,] "Cupressus_duclouxiana" "4" "4" "4" "0"
#> [10,] "Pteris_vittata" "3" "3" "1" "2"
#> #clade
#> [1,] "136"
#> [2,] "121"
#> [3,] "124"
#> [4,] "123"
#> [5,] "65"
#> [6,] "49"
#> [7,] "87"
#> [8,] "125"
#> [9,] "126"
#> [10,] "41"
The columns of the calculation results show the following information:
“OTU”: The names of the tree tips.
“perCladeOTUscore”: The final score of the tree tip calculated by “sum” divided by the number of taxa with the same “#clade”.
“sum”: The sum of “intruder” and “outlier”.
“intruder”: The intruder score of the tree tip.
“outlier”: The outlier score of the tree tip.
“#clade”: Identifier of clades of the same rank (Here, genus). Different clades have different #clade.
In short, the intruder score shows how many clades of the other ranks the tree tip is intruding, and the outlier score shows how far the tree tip is from the main clade of the belonging rank.
The result shows that “Araucaria_cunninghamii” is by far the top candidate to delete from the tree due to its high score.
Please note that this function assumes that all the names of tree tips are scientific names connected by underbars like “Homo_sapiens”. If the tree tips are named otherwise, please see the next chapter.
When you want to calculate score for the rank other than genus, or when the tree tips are not scientific names, you need a list of belonging ranks of the tree tips. Let’s see the rank list of the test tree called by the following command:
The contains of the rank list is like this:
data("testRankList")
testRankList[[1]][1:10]
#> [1] "Wollemia_nobilis"
#> [2] "Araucaria_bidwillii"
#> [3] "Araucaria_araucana"
#> [4] "Araucaria_angustifolia"
#> [5] "Ephedra_sinica"
#> [6] "Ephedra_przewalskii"
#> [7] "Ephedra_monosperma"
#> [8] "Cathaya_argyrophylla"
#> [9] "Pseudotsuga_sinensis_var._wilsoniana"
#> [10] "Larix_gmelinii"
testRankList[[2]][1:10]
#> [1] "Araucariaceae" "Araucariaceae" "Araucariaceae" "Araucariaceae"
#> [5] "Ephedraceae" "Ephedraceae" "Ephedraceae" "Pinaceae"
#> [9] "Pinaceae" "Pinaceae"
The rank list is a list of size 2. The first element is equivalent to
a character vector of the tree tips (obtained by, e.g.,
testTree$tip
). The second element is a character vector of
the rank names corresponding to the first element of the rank list. In
this test data, the rank list indicates the family of the test tree
tips. When the tree tips are not scientific names and you want to
calculate the score for genus, you can calculate it by setting the genus
names in the second element of the rank list.
Using this rank list, the score of test tree for family can be calculated by the following command:
The output can be interpreted just like the score for genus. The only difference is the score is based on monophyly of genus or family.
The score tells us which tree tip(s) are most erroneous in the tree. Therefore, repeating score calculation and deleting the top-score tip(s) until all tips have 0 scores can provide a tree without erroneous tips with the small number of deleted tips. The following commands conduct such auto deletion of erroneous taxa for the test tree:
The output of autoDeletion()
is a list of size 3. The
first element is the tree without erroneous tips. The second element is
a character vector of deleted tree tips. The third element is a list of
scores repeatedly calculated until all the erroneous tips are
deleted.
The functions calc.Score()
and
autoDeletion()
have arguments of show_progress
and num_threads
. show_progress
is a boolean
(TRUE or FALSE) and TRUE by default. When it is TRUE, the progress of
calculation is reported on the R console. When it is FALSE, it provides
no reports but the calculation will be slightly faster.
num_threads
is a positive integer and 1 by default. You
can specify the number of threads for faster calculation by this
argument. However, this option validly works only when OpenMP is
available, and the default compiler in MacOS does not support OpenMP.
Single thread calculation is still available for MacOS, but if you want
to use multiple threads in MacOS, you need to get OpenMP. One way to
install OpenMP is using following commands in the terminal:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install libomp
The function calc.Score()
also has an argument of
sort
. sort
is a boolean and TRUE by default.
When it is FALSE, the resultant score is no longer sorted by the
descending order, and it will be remained as the original order of the
tree tips.
The function get.upperRank()
returns the genus name of
given scientific names assuming that they are connected by underbars.
This may be useful to search upper ranks to make a rank list.