Bioinformatics Resources
Data Skills
Vince Buffalo’s Bioinformatic Data Skills is a great book to build skills. Katie has a set of great readings for learning, contact her for more info.
RAD seq
https://radcamp.github.io/NYC2019/RADCamp-PartII-Day1-AM.html
https://github.com/dereneaton/ipyrad/blob/master/newdocs/assembly_guidelines.rst
VCF files and population structure
[What is a VCF?] (https://samtools.github.io/hts-specs/VCFv4.2.pdf)
Katie also has some lectures on VCF files.
Filter VCF table and produce a PCA plot
When getting started, it’s best to filter for relatedness and linkage disequilibrium, so that you have a quasi-independent set of individuals and SNPs.
Check related individuals with vcf using vcftools (–relatedness)
Filter the VCF file. Use VCF tools to do this
for minor alleles < 0.05 (if you have a lot of SNPs, or 0.01 if you don’t have a lot of SNPs)
for missing data at SNPs (keep SNPs in at least X% of individuals in all dataset)
for missing data at population level (keeps SNPs with calls in X% or higher of individuals per population)
for individuals with low SNP count (keep individuals with higher than X% of SNPs called)
Depending on the context, X% might be 80% or might be 100%
To use the bigsnpr R package, you will have to change your data from the VCF format to a genotype matrix with 0/1/2 counts of the alternate allele. Transform from vcf to (plink to) raw. See the first steps in this tutorial.
The presence of LD can bias principal components. This paper explains the problem. Read about filtering for LD with pruning and clumping
Filter the raw data using snp_autoSVD (it’s at the end of the tutorial) This tutorial also teaches you how to do a PCA.
Structure plot
For Structure use LEA program sNMF package vcf input in R.
Run an RDA
Before you run an RDA and do some of the things in these tutorials, talk to Dr. L about your plan. Outlier SNP analysis with RDA have a lot of false positives, but the individual scores are very useful.
https://popgen.nescent.org/2018-03-27_RDA_GEA.html
https://github.com/laurabenestan/RDA_outlier
Katie also has some RDA code on simulations and oyster data.
Run a GWAS or GEA
Genotype-phenotype association: LFMM2. Read the paper lfmm2 and LEA3
LEA installation Check with Katie this is correct before running