R Scripts
Here you will find several R scripts that I wrote to automate certain tedious tasks that popped up when I was working on my dissertation. You are welcome to use them if they might be helpful to you, though be forewarned that I am not a serious R programmer so it is likely that there are much more elegant solutions to these problems. Also, by way of disclaimer, I can not make any guarantees regarding the functionality of the scripts and am not responsible for any damages incurred from their use. These are free for non-commercial use only. If you do use them, you should keep an eye on the output to make sure that they are doing what you want them to do. Lastly, if you find them to be useful or if you discover an error, please drop me an email to let me know. And if you use them for an analysis that is published, the favor of an acknowledgment would be appreciated!
1. Suppose that you have reason to believe that there might be a barrier subdividing a set of populations, and you would like to know if the distribution of widespread alleles (i.e., alleles shared by more than one population) at a given genetic marker reflects that barrier. In other words, if the barrier has limited past dispersal, we might expect widespread alleles to be shared more often among populations on one side of the barrier or the other than by populations on both sides. The script alleleRandomizer runs a simple randomization procedure that determines whether or not the observed number of widespread alleles distributed across the putative subdivision could be explained by chance alone.
2. Parametric bootstrapping (sometimes called the SOWH test) is a powerful means of testing alternative phylogenetic hypotheses, but it is computationally intensive and there seems to be some disagreement (or at least confusion) over how best to apply it. In my own work I've done my best to follow the method outlined by Nick Goldman and colleagues (e.g., see here), which to the best of my understanding boil down to 1) choose the best nucleotide substitution model (M1) for your data under the phylogenetic topology (H1) to be tested (lots of people use MODELTEST for this; I like DT-MODSEL), 2) using M1, find the best maximum likelihood (ML) tree for your empirical data while simultaneously estimating parameter values, 3) using M1, find the best ML tree constrained to H1 while simultaneously estimating parameter values, 4) generate lots (>=100) of simulated datasets on the ML tree using M1 and the parameter values estimated under the topology constraint (in step 3) , 5) conduct 2 ML tree searches (under M1, but estimating new parameter values each time) on each dataset - one constrained for the H1 topology and one unconstrained, 6) take the difference in log-likelihood scores from each set of searches and determine the proportion of times the simulated score differences were greater than or equal to the observed difference (scores taken from steps 2 and 3). If the simulated differences exceed the observed difference less than 5% of the time, then you can reject the H1 topology at alpha=0.05.
The main trouble with parametric bootstrapping is that it takes forever to do all those ML runs using tradition methods (i.e., PAUP). However, faster ML methods for estimating phylogenies are being developed. One of these is GARLI, an excellent program written by Derrick Zwickl. I became interested in using GARLI for parametric bootstrapping because it runs orders of magnitude faster than PAUP, estimates parameter values for substitution models on the fly, and can be set up to run under batch mode. Here I provide the scripts that I used to set up analyses and analyze the results. I'll assume that you have your own preferred method for identifying an appropriate model of nucleotide substitution. Once I've chosen one (generally by iterating through multiple runs of GARLI and DT-MODSEL until the model stabilizes), I do a final GARLI run with several replicates to identify a good constrained topology and set of parameter values, then I use these in MESQUITE to generate the simulated datasets (of equal length the the empirical data). I generate simple nexus formatted datasets. Then I use the script garli.BatchMaker to create the necessary infiles (two for each data matrix) and batch file to run the analyses in GARLI.
Note that you can choose to set up constrained, unconstrained, or both constrained and unconstrained batch runs using this, so it is useful for any batch analysis that you wish to perform in GARLI. Also note that this script allows you to use different GARLI executables, including the beta version of the software that can handle partitioned analyses (e.g., site-specific rates for different codon positions). I used the site-specific codon rates model for some analyses, and because I haven't figured out how to tell MESQUITE to add the necessary set definitions to the end of each simulated dataset, I wrote nexusCodonSetAppend to do it for me.
Once the analyses are run, you end up with a lot of trees to sort through. I wrote garliBestScoreExtractor to collect the best scores from each GARLI analysis, calculate the differences between the constrained and unconstrained topologies, and compare them to the empirical difference (if provided).
3. I have yet to discover the best way to compare non-nested substitution models, but one approach that I have seen used is similar to the parametric bootstrapping method outlined above. Datasets are simulated under a null model. Analyses are run on the datasets under the null model and an alternative model, and differences in likelihood scores are calculated. The distribution of differences is compared to the difference in scores for the empirical data. You can use the above scripts to set up the appropriate batch analysis in GARLI, but to calculate likelihood scores under the null and alternative models (and their differences) I wrote garliLscorebat, which creates a file that can be executed in PAUP to perform the score calculations.
The resulting score file can then be parsed in paupScoreHistMaker to compare the simulated results to your empirical result.
That's all for now. If I come up with any more I will continue to post them here. Thanks for your interest!