I) Filter the data based on the genotype likelihood (GL).
Based on our discussion of last week, we decided to apply the following filters:
1: The maf = 0.05
2: Call a SNP is it’s p-value is below \({10^{-6}}\)
3: Discard reads with mapping quality below 15 (minMapQ)
4: Discard bases with base quality below (minQ)
5: Keep SNPs present in at least 70% of the individuals i.e. 95 * 0.7 \(\approx\) 67 individuals
6: Sites with coverage ranging from 2 to 30 => -setMinDepth 190 -setMaxDepth 2850
This script was applied to each scaffold as a separate job using the following command:
“for i in $SCAFFOLD_NAME; do sbatch ./GL_per_scaffolds.sh $i; done”
The scaffold names can be retrieved from the fasta reference genome file (ajConsensus.fasta) using the following command: “SCAFFOLD_NAME=$(cat ajConsensus.fasta | grep”>“| awk ‘{print $1}’| sed ‘s/>//g’)”
#! /bin/bash
#SBATCH --job-name=GL_EcoHP
#SBATCH --partition=short
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-00:30:00
# Script name: GL_per_scaffolds.sh
################
# Define paths #
################
BAM_LIST="~/list_bam.txt" # List of path to the 95 bam files
PATH_RESULTS="~/GFL/All/$1"
GFL_File="Wasp_Pina_LD_Prunning_$1.glf"
REF="~/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"
mkdir -p $PATH_RESULTS
$ANGSD/angsd -GL 2 \
-doGlf 2 \
-doMaf 1 \
-doMajorMinor 1 \
-doPost 1 -doGeno 8 \
-ref ${REF} \
-r $1 \
-bam ${BAM_LIST} \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 \
-trim 0 -C 50 -baq 1 \
-out ${mkdir -p $PATH_RESULTS}/${GFL_File} \
-nThreads 1 \
-minMapQ 15 \
-minQ 15 \
-SNP_pval 1e-6 \
-skipTriallelic 1 \
-minMaf 0.1 \
-minInd 67 \
-setMinDepth 190 -setMaxDepth 2850 -doCounts 1
=> This script generates 2 important type of files: “glf.geno.gz” and “glf.mafs.gz” for each scaffold.
Then I gathered the results of each scaffold in a single glf.mafs and glf.geno files using the commands below:
PATH_RESULTS="/data/p274105/Wasp_Pina/Data/GFL/All/
cd $PATH_RESULTS
(for i in 000000F; do cd $i; zcat *glf.mafs.gz | head -n1 ; cd ..; done ; \
for i in $(ls| grep -v -E "All|txt"); do cd $i; zcat *glf.mafs.gz | awk 'NR>1' ; cd ..; done)> All_Chr.glf.mafs
(for i in $(ls| grep -v -E "All|txt"); do cd $i; zcat *glf.geno.gz ; cd ..; done) > All_Chr.glf.geno
After filtering 130571 SNPs were kept.
IV) Phylogenetic relationships
The phylogenetic relationship were inferred using:
1: NgsDIST to calculate the distance matrix between the taxa and the bootstrap distances matrices.
2: FASTME to infer the resulting phylogenetic tree.
3: RAXML to replace the bootstrap support value on the main tree.
4: The tree was mid-point rooted using R package ape.
5: The tree was plotted using the R package ggtree.
#! /bin/bash
#SBATCH --job-name=ngsDistWasp
#SBATCH --partition=regular
#SBATCH --mem=6GB
#SBATCH --cpus-per-task=1
#SBATCH --time=2-22:30:00
# To calculate the distance matrix :
# https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
# https://github.com/fgvieira/ngsDist
################
# Define paths #
################
PATH_RESULTS="~/Results/NJtree"
PATH_INPUTS="~/Results"
cat /data/p274105/Wasp_Pina/Data/New_pop.txt|cut -f 1 > $PATH_INPUTS/Labels.txt
FASTME="/data/p274105/tool_genomics/fastme-2.1.5/binaries/fastme-2.1.5-linux64"
##############
# NGSDIST
##############
# Programs needed
NGSDIST="/data/p274105/tool_genomics/ngsTools/ngsDist"
cd $PATH_INPUTS
GFL="Wasp_Pina_subset.glf.beagle.gz"
NSITES=`zcat $GFL| awk '{print $1}'| awk 'NR>1'|wc -l`
echo $NSITES
# 1000 boot, block 100 kb for whole genome SNPs (LD prunned)
module load GSL/2.5-GCC-8.2.0-2.31.1
$NGSDIST/ngsDist \
--n_threads 1 \
--verbose 1 \
--geno $PATH_INPUTS/$GFL \
--probs \
--n_ind 95 \
--pairwise_del \
--n_sites ${NSITES} \
--n_boot_rep 1000 \
--boot_block_size 5000 \
--labels $PATH_INPUTS/Labels.txt \
--out $PATH_RESULTS/output.dist
#--avg_nuc_dist \
echo "NGSDIST RAN"
##############
# FASTME
##############
$FASTME -i $PATH_RESULTS/output.dist \
-D 1001 \
-s \
-o $PATH_RESULTS/output.nwk
head -n 1 $PATH_RESULTS/output.nwk > $PATH_RESULTS/output.main.nwk
tail -n +2 $PATH_RESULTS/output.nwk | awk 'NF' > $PATH_RESULTS/output.boot.nwk
echo "FASTME ran"
#########
# RAXML #
#########
cd $PATH_RESULTS
module load RAxML/8.2.11-foss-2018a-mt-avx2
raxmlHPC -f b \
-t $PATH_RESULTS/output.main.nwk \
-z $PATH_RESULTS/output.boot.nwk \d
-m GTRCAT \
-T 1 \
-n output_final.tree
echo "RAXML ran"
The resulting phylogenetic tree was mid rooted and then plotted in R using the libraries ggtree and ape.
library(ggplot2)
library(ggtree)
library(ape)
library(phytools)
library(RColorBrewer)
setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/NJtree/")
tree <- ggtree::read.tree("RAxML_bipartitions.output_final.tree") # There are different read.tree() functions that executes different commands. To avoid any ambiguity here I add ape:: before to say that I want to use the read.tree() fonction from the ape package. This tree is a maximum likelihood tree of the 6 species of the porpoises family (Phoconidae) reconstructed with PhyML.
#tree = ape::root(tree,node=109, resolve.root=TRUE)
#######################################
# Transform the tree to a ggtree tree #
#######################################
tree2 <- tree
tree2 <- midpoint.root(tree)
tree2 <- ggtree(tree2)
########################################
# Define a color code for each species #
########################################
data_color = cbind(tree2$data$label[1:95], rep("test")) # Combine the column with individual names with a new column that will contain the color associated to each individual in an objected called data_color
colnames(data_color) =c("species","color") # Give a name to the column of data_color
data_color = as.data.frame(data_color) # Transform data_color to a data frame
data_color[,2] = as.character(data_color[,2]) # Transform the color column into characters (insstead of factor)
pop = read.table("../../bam_location_correspondance.txt")
pop = data.frame(inds=pop$V1, locality = pop$V2)
pop$color = "NAN"
data_color$locality = "NAN"
species = c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai") # Create a vector with the unique species name
color = brewer.pal(10,"Paired") # Create a vector with the color associated to each unique species name
info = cbind(species,color) # Combine the unique species and the color in an object called info
# Loop to attribute to each individuals its color
for (i in 1:95){
for (j in 1:10) {
if(grepl(info[j,1],pop[i,2])){
pop[i,3] = info[j,2]
}
}
}
# Loop to attribute to each individuals its color
for (i in 1:95){
for (j in 1:95) {
if(pop[j,1]==data_color[i,1]){
data_color[i,2] = pop[j,3]
data_color[i,3] = pop[j,2]
}
}
}
############################
# Define bootstraps groups #
############################
tree2$data$col <- cut(as.numeric(tree2$data$label),
breaks = c(0, 90, 100),
labels = c("<=90", ">90"))
for (i in 1:189){
for (j in 1:95) {
if(data_color[j,1]==tree2$data[i,4]){
tree2$data[i,4] = paste(tree2$data[i,4], data_color[j,3], sep = " ")
}
}
}
tree2$data$label = gsub("(\\w+) (\\w+)", "\\2",tree2$data$label, perl = TRUE)
#############################################
# Plot full tree to have the final topology #
#############################################
p <- ggtree(tree2$data) + geom_tiplab(size =4, color= data_color$color) + geom_point2(aes(subset=(!isTip), fill= tree2$data$col, size = tree2$data$col, shape=tree2$data$col)) + theme(legend.position="right") + scale_fill_manual(breaks = c("<=90", ">90") , values = c("green","red")) +
scale_size_manual(breaks = c("<=90", ">90") , values = c(1.2,2.5)) + scale_shape_manual(breaks = c("<=90", ">90") , values = c(21,21)) + guides(size = FALSE, shape=FALSE, fill = guide_legend(override.aes=list(shape=21))) + labs(fill="Bootstraps")
p = p + geom_treescale(x=0.30) + theme(legend.title = element_text(color = "black", size = 16),legend.text = element_text(color = "black", size = 14)) + theme(legend.background = element_rect(size=1, color = "black", linetype="solid"))
p = p + ggplot2::xlim(0, 0.40) + theme(legend.position = c(0.1, 0.9))
p

=> The tree is congruent with the analyses based on SNP calling.
V) PCA
1) Run PCangsd
PCangsd was used to get the PCA as well as to estimate the admixture proportion (option -admix). It was ran with different value of K ranging from 1 to 8. It was ran using the following command:
for i in {1..5}; do sbatch ./PCangsd.sh $i; done
#! /bin/bash
#SBATCH --job-name=PCAngsd
#SBATCH --partition=short
#SBATCH --mem=2GB
#SBATCH --cpus-per-task=1
#SBATCH --time=00:25:00
#Script name: PCangsd.sh
################
# Define paths #
################
PATH_GFL="~/Results"
OUTPUT_PATH="~/Results/PCangsd"
OUTPUT="PCAangsd_results_K$1"
PATH_SCRIPT="~/Script/PCangsd"
module load PCAngsd/0.98-foss-2018a-Python-3.6.4
cd $PATH_GFL
pcangsd.py -beagle ${PATH_GFL}/Wasp_Pina_subset.glf.beagle.gz \
-admix \ # estimate admixture proportion
-admix_K $1 \ # Estimate admixture proportion with K = $1 groups.
-admix_iter 500 \ # Perform 500 EM iterations for admixture estimations.
-admix_alpha 50 \
-o ${OUTPUT_PATH}/${OUTPUT} \
-threads 1 \
-e 5 \ # Use 5 eigenvalues to model individual allele frequencies
-admix_tole 1e-7 \ # Tolerance value for update in admixture estimations
-minMaf 0.10 # Make sure the maf it at 0.
2) PC1 and PC2
The raw PCA was plotted using this R commands
setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")
library(RcppCNPy)
library(ggplot2)
library(RColorBrewer)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(ggtree)
library(cowplot)
library(tidyverse)
library(patchwork)
library(plyr)
library(ggforce)
D <- npyLoad("PCAangsd_results_K5.cov.npy")
e <- eigen(D)
pop<-read.table("../../../bam_location_correspondance.txt")
colnames(pop)= c("ID","Location")
cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
population=unique(pop$Location)
myColors <- brewer.pal(10,"Paired")
All_data = cbind(e$vectors,pop)
data = cbind(All_data[,c(1,2)],pop)
colnames(data) = c("PC1","PC2","ID","Location")
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
data$reproduction = "ToBeAdded"
data[data$Location=="Iriomote" | data$Location=="Okinawa" | data$Location=="Amami",5] = c("Sexual")
data[data$reproduction!="Sexual",5] = "Asexual"
repro = data[,5]
ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=reproduction)) + geom_point(size=2) + theme_bw() +
xlab(paste("PC1 (",round(eigenvalues[1,3],digits=2)," %)",sep="")) +
ylab(paste("PC2 (",round(eigenvalues[2,3],digits=2)," %)",sep="")) +
scale_color_manual(values = myColors) + stat_ellipse(level = 0.95, size = 1)

=> The tree is consistent with the analyses based on SNP calling.
##################################
# Plot eigenvalues contributions #
##################################
ggplot(data=eigenvalues[1:10,], aes(x=PC, y=Contributions)) +
geom_bar(stat="identity", fill="steelblue") +
theme_bw() +
scale_x_continuous(breaks = unique(sort(eigenvalues$PC))) +
ylab("Contribution (%)") +
xlab("Principal Component")

scree<-eigenvalues[1:10,c(1,3)] %>%
ggplot(aes(x=PC,y=Contributions))+theme_bw()+
geom_col()+
labs(title="Scree plot",y="",x="")+theme(
title=element_text(size=8))
The contribution in % of each PC to the total variance of the PC is plot in the barplot above. 5 first PCs stand out and explain 61.8% of the total variance.
Below is the same scatter plot as before but trying to make it look more like the one of the manuscript:

3) Plot additional PCs (PC1 to PC5)
plots = list()
counter = 1
for (i in 1:5){
for (j in 1:5){
if(j>i){
if(j<5 || i<4){
data = cbind(All_data[,c(i,j)],pop,repro)
colnames(data) = c("PC1","PC2","ID","Location","Reproduction")
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
plot = ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=Reproduction)) + geom_point(size=2) + theme_bw() +
xlab(paste("PC",i," (",round(eigenvalues[i,3],digits=2)," %)",sep="")) +
ylab(paste("PC",j," (",round(eigenvalues[j,3],digits=2)," %)",sep="")) +
scale_color_manual(values = myColors) + theme(legend.position="none") #+ stat_ellipse(level = 0.95, size = 1)
plots[[counter]] = plot
counter = counter + 1
}
else{
data = cbind(All_data[,c(i,j)],pop,repro)
colnames(data) = c("PC1","PC2","ID","Location","Reproduction")
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai","Missing"))
plot = ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=Reproduction)) + geom_point(size=2) + theme_bw() +
xlab(paste("PC",i," (",round(eigenvalues[i,3],digits=2)," %)",sep="")) +
ylab(paste("PC",j," (",round(eigenvalues[j,3],digits=2)," %)",sep="")) +
scale_color_manual(values = myColors) +
theme(legend.key.size = unit(0.5, 'cm')) #+ stat_ellipse(level = 0.95, size = 1)
plots[[counter]] = plot
counter = counter + 1
}
}
}
}
grid.arrange(grobs=plots, layout_matrix=rbind(c(1, 1, 1, 2, 2, 2, 3, 3, 3,4, 4, 4),
c( 5, 5, 5, 6, 6, 6,7, 7, 7, 8, 8 ,8),c(9,9,9, 10,10,10,10,NA,NA,NA,NA,NA)))

PC1: Separate asexual, Iriomote and the 2 other sexual.
PC2: Separate asexual, Iriomote and the 2 other sexual (so same as PC1)
PC3: Separate the asexual one into 2 groups with 3 individuals from Kobe being an hybrid between the two groups.
PC4: Separate Okinawa from Amami (the two sexual one excluding Iriomote)
PC5: No more group could be delineate after PC4. So no really new cluster is observed in PC5.
4) Plot admixture proportion estimated by pCangsd for K=1 to K=8
setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")
New_col = c("darkcyan","#A52A2A", "#8A2BE2", "darkolivegreen4","darkorchid4","black","orange","grey")
par(mfrow=c(4,2), mai = c(0.3, 0.3, 0.3, 0.4))
for (i in 1:8){
D <- npyLoad(paste("PCAangsd_results_K",i,".cov.npy",sep=""))
e <- eigen(D)
pop<-read.table("../../../bam_location_correspondance.txt")
colnames(pop)= c("ID","Location")
cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
population=unique(pop$Location)
myColors <- brewer.pal(10,"Paired")
All_data = cbind(e$vectors,pop)
data = cbind(All_data[,c(1,2)],pop)
colnames(data) = c("PC1","PC2","ID","Location")
D2 <- npyLoad(paste("PCAangsd_results_K",i,".admix.Q.npy",sep=""))
data = cbind(data,D2)
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
data = with(data, data[order(Location),])
to_plot = t(data[,5:(4+i)])
colnames(to_plot) = NULL
s = replace(rep(0, 95), !duplicated(as.character(data$Location)), 0.5)
g <- table(data$Location)
barplot(to_plot,col=New_col,main=paste("K",i,sep=""), border = NA, space = s)
text(g / 2 + c(0, cumsum(g[-length(g)])), -0.15, names(g), srt = 25, xpd = NA)
}

The results echo to PCA. The admixture proportion obtain at K=5 are very similar to the admixture proportion obtained with the software admixture. Here is a zoom at K=5.
setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")
for (i in 5){
D <- npyLoad(paste("PCAangsd_results_K",i,".cov.npy",sep=""))
e <- eigen(D)
pop<-read.table("../../../bam_location_correspondance.txt")
colnames(pop)= c("ID","Location")
cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
population=unique(pop$Location)
myColors <- brewer.pal(10,"Paired")
All_data = cbind(e$vectors,pop)
data = cbind(All_data[,c(1,2)],pop)
colnames(data) = c("PC1","PC2","ID","Location")
D2 <- npyLoad(paste("PCAangsd_results_K",i,".admix.Q.npy",sep=""))
data = cbind(data,D2)
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
data = with(data, data[order(Location),])
to_plot = t(data[,5:(4+i)])
colnames(to_plot) = NULL
s = replace(rep(0, 95), !duplicated(as.character(data$Location)), 0.5)
g <- table(data$Location)
barplot(to_plot,col=New_col[1:5],main=paste("K",i,sep=""), border = NA, space = s)
text(g / 2 + c(0, cumsum(g[-length(g)])), -0.15, names(g), srt = 25, xpd = NA)
}

VII) FST
1) Compute GL for each population and Site frequency spectrum for each population
The calculation of the FST with angsd rely on the SFS. The SFS must be estimated separately for each population of interest. The FST were computed between:
1) 5 (Ks) population inferred with the clustering approaches (NGSadmix, Admixture, PCangsd).
2) The 10 geographical locations.
The SFS for each population were estimated using the following command:
#! /bin/bash
#SBATCH --job-name=GL
#SBATCH --partition=regular
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-8:00:00
################
# Define paths #
################
module load HTSlib/1.12-GCC-10.2.0
BAM1_LIST="~/Data/BAM_FILES/BAM_$1.txt"
PATH_RESULTS="~/FST/GFL/$1"
GFL_File1="$1_intersect.glf"
REF="~/Data/ref_genome/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"
mkdir -p $PATH_RESULTS
#######################################################
# COMPUTE MIN SAMPLE SIZE AND MIN/MAX DEPTH for POP 1 #
#######################################################
NB_SAMPLES=$(cat $BAM1_LIST|wc -l)
MINDEPTH=$((NB_SAMPLES*2)) # Min depth 2X
MAXDEPTH=$((NB_SAMPLES*30)) # Max depth 30 X
MIN_sample=$((($NB_SAMPLES * 80 ) / 100)) # Min number of individuals 80%
echo $MINDEPTH
echo $MAXDEPTH
echo $MIN_sample
#########################################
# COMPUTE GENOTYPE LIKELIHOODS FOR POP1 #
#########################################
$ANGSD/angsd -GL 2 \
-ref ${REF} \
-anc ${REF} \
-bam $BAM1_LIST \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 \
-trim 0 -C 50 -baq 1 \
-out ${PATH_RESULTS}/${GFL_File1} \
-nThreads 1 \
-minMapQ 20 \
-minQ 20 \
-doSaf 1 \ # To estimate the site frequency spectrum
-minInd $MIN_sample \
-doCounts 1 \
-setMinDepth $MINDEPTH -setMaxDepth $MAXDEPTH
Because we are not using an outgroup to get the ancestral state, the SFS must be folded. HOWEVER, the folding of the SFS should be performed in a later step This is why:

Also because the analyses is based on the SFS, monomorphic and rare variants must be kept to infer an unbiaised SFS. Therefore the flags on the minor allele frequency and on SNP P-value are not used here.

Applying these filters I retained between 16,231,719 and 202,084,034 sites (not SNPs) depending on the population.
2) Estimate 2DSFS + compute FST based on as many sites as possible
1: Compute the folded 2D SFS each pair of population.
2: Index the SFS.
3: Get the global estimate of the FST for each pair of population.
4: Compute the weighted FST over 10kb windows.
#! /bin/bash
#SBATCH --job-name=SFS_FST
#SBATCH --partition=gelifes
#SBATCH --mem=35GB
#SBATCH --cpus-per-task=8
#SBATCH --time=0-20:00:00
# cript name: SFS_FST.sh
################
# Define paths #
################
module load HTSlib/1.12-GCC-10.2.0
BAM1_LIST="~/Data/BAM_FILES/BAM_$1.txt"
BAM2_LIST="~/Data/BAM_FILES/BAM_$2.txt"
PATH_RESULTS="~/Results/FST/GFL/$1_$2"
GFL_File1="Wasp_Pina_$1.glf"
GFL_File2="Wasp_Pina_$2.glf"
REF="~/Data/ref_genome/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"
mkdir -p $PATH_RESULTS
##########
# ML SFS #
##########
#calculate the 2dsfs prior
$ANGSD/misc/realSFS $PATH_RESULTS/*"$1".glf.saf.idx $PATH_RESULTS/*"$2".glf.saf.idx -fold 1 -P 8 > $PATH_RESULTS/"$1"_"$2".ml
echo "2DSFS READY"
#prepare the fst for easy window analysis
$ANGSD/misc/realSFS fst index $PATH_RESULTS/*"$1".glf.saf.idx $PATH_RESULTS/*"$2".glf.saf.idx -sfs $PATH_RESULTS/"$1"_"$2".ml -fold 1 -whichFst 1 -fstout $PATH_RESULTS/"$1"_"$2" -P 8
echo "DATA READY FOR SFS CALCULATION"
#get the global estimate
echo -e Unweighted"\t"Weighted > $PATH_RESULTS/"$1"_"$2"_results.fst
$ANGSD/misc/realSFS fst stats $PATH_RESULTS/"$1"_"$2".fst.idx -fold 1 -whichFst 1 >> $PATH_RESULTS/"$1"_"$2"_results.fst
echo "GLOBAL FST COMPUTED"
# FST in sliding windows of 50 kb
$ANGSD/misc/realSFS fst stats2 $PATH_RESULTS/"$1"_"$2".fst.idx -win 10000 -step 10000 -fold 1 -whichFst 1 > $PATH_RESULTS/Sliding_windows_FST_"$1"_"$2".txt
echo "SLIDING WINDOW FST COMPUTED"
rm $PATH_RESULTS/*gz
# GL for everything. Including monomorphic sites. GL estimate for each pop separately. Then used to estimate SFS within pop. Then used to estimate the 2D SFs. Finally 2Dsfs is used to estimate the FST. FST were computed globally + within windows of 50kB.
The -whichFST option was used because it performs better with low sample sizes.
To avoid double comparisons (e.g. K1 with K2 and K2 with K1) this script was run as follow:
“for i in K{1..5}; do for j in K{1..5}; do if [ "\(i" \< "\)j" ]; then sbatch ./SFS_FST $i $j; fi; done; done”
3) Average FST values + 95% confidence interval (95CI)
The following R command was used to estimate the 95CI. It used the R package Rmisc which assume that the data are t-distributed (number of window-1 degree of freedom)
suppressPackageStartupMessages(library(Rmisc))
library(Rmisc)
args = commandArgs(trailingOnly=TRUE)
fst = read.table(args)
CI(unlist(fst), ci=0.95)
The previous R command was ran inside the following bash script:
module load R
cd GFL
(echo -e Pop1"\t"Pop2"\t"Global_FST"\t"Mean_FST"\t"Low_CI_FST"\t"High_CI_FST
for i in $(ls| grep -P "K\d_K\d")
do
POP=$(echo $i| perl -pe 's/_/\t/g')
cd $i
GLOBAL_FST=$(cat *fst|awk '{print $2}'|awk 'NR>1 {print $0}')
cat Sliding*|awk '{print $5}' > bizbizfaitlamouche.txt
MEAN_FST=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 2 -d " ")
LOW_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 3 -d " ")
HIGH_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 1 -d " ")
echo -e $POP"\t"$GLOBAL_FST"\t"$MEAN_FST"\t"$LOW_CI"\t"$HIGH_CI
rm bizbizfaitlamouche.txt
cd ..
done) > ../FST_between_NGSadmix_GROUP.txt
(echo -e Pop1"\t"Pop2"\t"Global_FST"\t"Mean_FST"\t"Low_CI_FST"\t"High_CI_FST
for i in $(ls| grep -v -P "K\d|maf")
do
POP=$(echo $i| perl -pe 's/_/\t/g')
cd $i
GLOBAL_FST=$(cat *fst|awk '{print $2}'|awk 'NR>1 {print $0}')
cat Sliding*|awk '{print $5}' > bizbizfaitlamouche.txt
MEAN_FST=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 2 -d " ")
LOW_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 3 -d " ")
HIGH_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 1 -d " ")
echo -e $POP"\t"$GLOBAL_FST"\t"$MEAN_FST"\t"$LOW_CI"\t"$HIGH_CI
rm bizbizfaitlamouche.txt
cd ..
done) > ../FST_between_GEO_GROUP.txt
a) FST based on NGSadmix groups
b) FST based on Geographic groups
4) Visualization with heatmaps (based on the Mean_FST)
a) FST based on NGSadmix groups
library(viridis)
library(ggtext)
FST = read.table("FST_between_NGSadmix_GROUP.txt",header =T)
FST = FST[,c(1,2,4,5,6)]
FST$To_plot = paste0('**', round(FST$Mean_FST,3),'**<br />[',round(FST$Low_CI_FST,3),'-',round(FST$High_CI_FST,3),']' , sep = "")
p = ggplot(data =FST, aes(x=Pop1, y=Pop2, fill=Mean_FST)) +
geom_tile(colour = "black") + scale_fill_gradientn(colours = viridis(10000)) + theme(plot.title = element_text(hjust = 0.5)) + labs(fill = "FST") + xlab("Population 1") + ylab("Population 2") + theme_bw() + geom_textbox(label = FST$To_plot, box.size = 0, fill = NA, halign = 0.5, colour = ifelse(FST$Mean_FST < 0.20,'white','black'),size=4) + guides(colour = FALSE) + scale_color_manual(values=c('black','white')) +
theme(axis.text.x = element_text(colour = New_col[1:4], size = 11,face="bold")) +
theme(axis.text.y = element_text(colour = New_col[2:5], size = 11,face="bold")) +
theme(legend.title = element_text(size=16, face = "bold")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

NA
NA
=> The results are overall consistent with SNP calling results. There are still three differences:
a: The FST values are lower.
b: The FST values between K4 and K5 (between asexual pop.) tend to be larger (relatively speaking compared to the rest)
c: The FST values between the sexual population Iriomote (K1) vs K2/K3 (Amami/Okinawa) show lower FSt values than comparisons between Iriomote and K4/5 (asexual). The difference was less pronounced in the analysis performed by Pina:

b) FST based on Geographic groups
library(reshape2)
library(dplyr)
FST = read.table("FST_between_GEO_GROUP.txt",header =T)
FST = FST[,c(1,2,4,5,6)]
FST$To_plot = paste0('**', round(FST$Mean_FST,3),'**<br />[',round(FST$Low_CI_FST,3),'-',round(FST$High_CI_FST,3),']' , sep = "")
for (row in 1:dim(FST)[1]){
if(FST[row,c(1)]!=FST[row,c(2)]){
FST = rbind(FST,FST[row,])
tmp_pop1 = FST[nrow(FST),c(1)]
tmp_pop2 = FST[nrow(FST),c(2)]
FST[nrow(FST),c(2)] = tmp_pop1
FST[nrow(FST),c(1)] = tmp_pop2
}
}
FST$Pop1 <- factor(FST$Pop1, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Sendai","Tokyo"))
FST$Pop2 <- factor(FST$Pop2, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Sendai","Tokyo"))
FST = arrange(FST, Pop1, Pop2)
list = list()
a = 1
for (i in 1:dim(FST)[1]){
for (j in 1:dim(FST)[1]){
if(j > i){
if(FST[i,1]==FST[j,2] && FST[i,2]==FST[j,1]){
list[[a]] = FST[i,]
a = a + 1
}
}
}
}
FST = do.call(rbind,list)
p = ggplot(data =FST, aes(x=Pop1, y=Pop2, fill=Mean_FST)) +
geom_tile(colour = "black") + scale_fill_gradientn(colours = viridis(10000)) + theme(plot.title = element_text(hjust = 0.5)) + labs(fill ="FST") + xlab("Population 1") + ylab("Population 2") + theme_bw() + geom_textbox(label = FST$To_plot, box.size = 0, fill = NA, halign = 0.5, colour = ifelse(FST$Mean_FST < 0.20,'white','black'),size=5) + guides(colour = FALSE) + scale_color_manual(values=c('black','white')) +
theme(axis.text.x = element_text(colour = myColors[1:9], size = 16,face="bold")) +
theme(axis.text.y = element_text(colour = myColors[2:10], size = 16,face="bold")) +
theme(legend.title = element_text(size=16, face = "bold")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
p

=> Consistent with all the analyses.
IX) Kinship (for subset of 54 individuals)
PCangsd was used to estimate the relatedness between each pair of individuals while taking into account the genetic structure (option -kinship). The method implemented in ANGSD is based on the PCrelate approach (https://www.cell.com/ajhg/fulltext/S0002-9297(15)00493-0).
For the sake of simplicity, I first ran PCrelate for the 95 available individuals. Then afterward I sub-sampled the resulting relatedness matrix to have the relatedness coefficient just for the 54 individuals for which microbiome data are available.
PCrelate was ran on the LD pruned data set using the below script:
#! /bin/bash
#SBATCH --job-name=PCrel
#SBATCH --partition=short
#SBATCH --mem=1GB
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=h.y.ben.chehida@rug.nl
#SBATCH --time=00:30:00
# PCAngsd needs and accepts are genotype likelihoods in Beagle format.
# We generated it already for the NGSadmix analysis. So we can reuse it here.
################
# Define paths #
################
GFL="/data/p274105/Wasp_Pina/Results/Wasp_Pina_subset.glf.beagle.gz"
OUTPUT="/data/p274105/Wasp_Pina/Results/PCrelate"
module load PCAngsd/0.98-foss-2018a-Python-3.6.4
module load R/4.0.0-foss-2020a
#################################
# Estimate pairwise relatedness #
#################################
pcangsd.py \
-beagle $GFL \
-o $OUTPUT/inbreeding_1/Res_1 \
-threads 1 \
-inbreed 1 \
-inbreed_iter 1000 \
-kinship \
-inbreed_tole 1e-6
The R code below:
1: Extract the pairwise relatedness coefficients for the 54 individuals of interests.
2: Set the negative relatedness values to 0.
3: Visually summarize the results using a default heat map (I can modify it so it looks like the FST one if needed).
#!/usr/bin/env Rscript
library(RcppCNPy) # library to read numpy files
options(scipen=999) # use full number instead of scientific notation
library(pheatmap)
library(dplyr)
kinship <- npyLoad(filename="PCrelate/inbreeding_1/Res_1.kinship.npy") #Import kinship matrix in numpy format
kinship = as.data.frame(kinship) # transform the matrix to a data frame
individuals <- read.table("../bam_location_correspondance.txt") # individuals name and pop of each individual
colnames(kinship) =unlist(individuals[,c(1)]) # add individuals name to each column
rownames(kinship) =unlist(individuals[,c(1)]) # add inviduals name to each row
kinship = kinship[rownames(kinship) %in% unlist(ind_to_subset),colnames(kinship )%in% unlist(ind_to_subset)]
kinship[kinship < 0] = 0 # kinship values inferior to 0 are not biologically meaningful. They are set to 0.
#kinship[kinship > 0.25] = 1
individuals = individuals[individuals$V1 %in% unlist(ind_to_subset),]
geography = as.data.frame(individuals[,2],stringsAsFactors=TRUE) # store information on geography in a separate object
rownames(geography) = individuals[,1]
colnames(geography) = "Geography"
geography$Geography <- factor(geography$Geography, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
geography$Geography = sort(geography$Geography)
# Block to assign desired colors to the desired populations.
mycolors = cbind(unique(geography),myColors)
geo = geography
geography$colors = "empty"
for (i in 1:dim(geography)[1]){
for (j in 1:dim(mycolors)[1]){
if(geography[i,1]==mycolors[j,1])
geography[i,2] = mycolors[j,2]
}
}
# Prepare the colors in the format desired by pheatmap
mat_colors <- list(Geography = myColors)
names(mat_colors$Geography) <- unique(unlist(geo))
# Pheatmap plot
pheatmap(t(kinship),na_col="white",annotation_col=geo,annotation_row=geo,cluster_rows=F,cluster_cols=F,annotation_colors = mat_colors)

max(unlist(kinship[kinship < 0.25]))
[1] 0.0407
The maximal relatedness values is equal to 0.0407
If we look at the distribution of the estimated relatedness values
plot(density(unlist(kinship[kinship < 0.25])))

Basically none of the individuals are related.
This result strongly contrasts with the results obtained using SNPrelate:

With SNPrelate many of the values are very high.
The following R code:
1: Label individuals according to their geographic ID (not the one in the bam files starting with Sx)
2: Create and display the matrix values that I will share with Pina (for correlation with the microbiome)
# Reverse distance matrix names to use the real geographical ID
for (i in 1:dim(test1)[1]){
for (j in 1:dim(ind_to_subset)[1]){
if (colnames(kinship)[j]==test1[i,2]){
colnames(kinship)[j]=test1[i,1]
rownames(kinship)[j]=test1[i,1]
}
}
}
# order matrix geographically
order_matrix = read.table("ind_for_matrix.txt")
kinship = kinship[unlist(order_matrix),unlist(order_matrix)]
kinship = round(kinship,digits=4)
# Create a text file with the relatedness matrix
write.table(kinship, file="matrix_relatedness_54_ind.txt",sep="\t",quote=FALSE,row.names = TRUE,
col.names = TRUE)
# Display matrix
kinship
Write final matrix of relatedness (handle missing tab in the beginning)
cat matrix_relatedness_54_ind.txt | head -n 1 | perl -pe 's/I_4/\tI_4/g' > matrix_relatedness.txt
cat matrix_relatedness_54_ind.txt| awk 'NR>1' >> matrix_relatedness.txt
rm matrix_relatedness_54_ind.txt
mv matrix_relatedness.txt matrix_relatedness_54_ind.txt
---
title: "Genotype likelihood based analyses. Pina et al. 2022."
subtitle: "Population structure and differentiation of the wasp *Asobara japonica*"
author: "Yacine Ben Chehida"
output: 
  html_notebook: 
    toc: yes
editor_options: 
  chunk_output_type: inline
---

```{r setup, include=FALSE}
remotes::install_github('yihui/knitr')
library(knitr)
opts_chunk$set(echo = TRUE)
```


# I) Filter the data based on the genotype likelihood (GL). 
Based on our discussion of last week, we decided to apply the following filters:

+ **1:**  The maf = 0.05

+ **2:** Call a SNP is it's p-value is below ${10^{-6}}$

+ **3:** Discard reads with mapping quality below 15 (minMapQ)

+ **4:** Discard bases with base quality below (minQ)

+ **5:** Keep SNPs present in at least 70% of the individuals i.e. 95 * 0.7 $\approx$ 67 individuals

+ **6:** Sites with coverage ranging from 2 to 30 => -setMinDepth 190  -setMaxDepth 2850

This script was applied to each scaffold as a separate job using the following command:

"for i in $SCAFFOLD_NAME; do sbatch ./GL_per_scaffolds.sh $i;  done"

The scaffold names can be retrieved from the fasta reference genome file (ajConsensus.fasta) using the following command:
"SCAFFOLD_NAME=$(cat ajConsensus.fasta | grep ">"| awk '{print $1}'| sed 's/>//g')"
```{bash}
#! /bin/bash

#SBATCH --job-name=GL_EcoHP
#SBATCH --partition=short
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-00:30:00

# Script name: GL_per_scaffolds.sh

################
# Define paths #
################
BAM_LIST="~/list_bam.txt" # List of path to the 95 bam files
PATH_RESULTS="~/GFL/All/$1"
GFL_File="Wasp_Pina_LD_Prunning_$1.glf"
REF="~/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"

mkdir -p $PATH_RESULTS

$ANGSD/angsd -GL 2 \
-doGlf 2 \
-doMaf 1  \
-doMajorMinor 1 \
-doPost 1 -doGeno 8 \
-ref ${REF} \
-r $1 \
-bam ${BAM_LIST} \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 \
-trim 0 -C 50 -baq 1 \
-out ${mkdir -p $PATH_RESULTS}/${GFL_File} \
-nThreads 1 \
-minMapQ 15 \
-minQ 15 \
-SNP_pval 1e-6 \
-skipTriallelic 1 \
-minMaf 0.1 \
-minInd 67 \
-setMinDepth 190  -setMaxDepth 2850 -doCounts 1

```
<br/>
=> This script generates 2 important type of files: "glf.geno.gz" and "glf.mafs.gz" for each scaffold.

Then I gathered the results of each scaffold in a single glf.mafs and glf.geno files using the commands below:
```{bash}
```


```{bash}
PATH_RESULTS="/data/p274105/Wasp_Pina/Data/GFL/All/

cd $PATH_RESULTS

(for i in 000000F; do cd $i; zcat *glf.mafs.gz | head -n1  ; cd ..; done ; \
for i in $(ls| grep -v -E "All|txt"); do cd $i; zcat *glf.mafs.gz | awk 'NR>1'  ; cd ..; done)> All_Chr.glf.mafs

(for i in $(ls| grep -v -E "All|txt"); do cd $i; zcat *glf.geno.gz  ; cd ..; done) > All_Chr.glf.geno
```


After filtering **130571** SNPs were kept.
<br/><br/>

# II) Linkage desequilibrium pruning
LD pruning was applied using the software NGSLD (https://github.com/fgvieira/ngsLD).
First let's check how LD decays with genetic distance.

The script below:

+ **1)**  Estimate LD between pair of SNP up to a distance of 50 Kb

+ **2)** Plot LD decay 
 
```{bash}
#! /bin/bash

#SBATCH --job-name=LD
#SBATCH --partition=regular
#SBATCH --mem=15GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-20:00:00

 #Script name: NgsLD_per_scaffolds.sh
 
################
# Define paths #
################
PATH_RESULTS="/data/p274105/Wasp_Pina/Data/GFL/All/"
NGSLD="/data/p274105/tool_genomics/ngsTools/ngsLD/"

##################
# Load libraries #
##################
module load R/4.0.0-foss-2020a 
module load Perl/5.26.1-foss-2018a
#cpanm --local-lib=~/perl5 local::lib
#eval $(perl -I ~/perl5/lib/perl5/ -Mlocal::lib)
#cpanm Graph::Easy

######################################
# Count the number of available SNPs #
#######################################
cd $PATH_RESULTS
cat  All_Chr.glf.mafs | cut -f 1,2 | tail -n +2 > Test_pos.txt
NS=`cat Test_pos* | wc -l` 

######################################################
# Estimate LD between every pair of SNPs up to 50 kb #
######################################################
$NGSLD/ngsLD --geno All_Chr.glf.geno --n_ind 95 --max_kb_dist 50 --n_sites $NS --out NGSLDOUTPUT.txt --pos Test_pos* --verbose 1

#########################################
# Plot the graph of LD decay over 50 kb #
#########################################
Rscript --vanilla --slave $NGSLD/scripts/fit_LDdecay.R --ld_file file_name.txt --out plot_LD.pdf --fit_level 2 --ld r2 --n_ind 95 --plot_y_lim 0.5 --plot_x_lim 50
```
<br/>
The results show the LD (measured as $r^2$) goes below 0.1 after around **20 Kb**. 
<br/>
```{r echo=FALSE,  include=TRUE, out.width="75%"}
knitr::include_graphics(path="Screenshot 2022-07-19 at 09.10.39.png")
```

Therefore, subsequently LD was estimated between every pair of SNPs up to a distance of  **20 Kb** using the following command:
```{bash}
cd $PATH_RESULTS

module load Perl/5.26.1-foss-2018a

perl $NGSLD/scripts/prune_graph.pl --in_file NGSLDOUTPUT.txt --max_kb_dist 20 --min_weight 0.6 --out LD_unlinked.id
```
<br/>
**88485** SNPs remained after LD pruning. 
<br/><br/>

# III) GL on the LD pruned data
<br/>
It's basically the same as for **I)** just the analysis was performed on a subset of LD pruned SNPs. 
The subset of LD pruned SNPs are provided as argument to angsd using the option **-sites**.

```{bash}
#! /bin/bash

#SBATCH --job-name=GL_EcoHP
#SBATCH --partition=short
#SBATCH --mem=8GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-00:30:00

#Script name: GL_subset_per_scaffolds.sh

################
# Define paths #
################
BAM_LIST="~/list_bam.txt"
PATH_INPUT="~/GFL/All/"
PATH_OUTPUT="~/GFL/LD_prunned/$1"
GFL_File="Wasp_Pina_subset_$1.glf"
REF="~/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"

cd $PATH_INPUT
mkdir -p $PATH_OUTPUT

$ANGSD/angsd sites index ${PATH_RESULTS}/LD_unlinked_to_sort.id

$ANGSD/angsd -GL 2 \
-doGlf 2 \
-doMaf 1  \
-doMajorMinor 1 \
-doPost 1 -doGeno 8 \
-ref ${REF} \
-r $1 \
-bam ${BAM_LIST} \
-sites ${PATH_INPUT}/LD_unlinked_to_sort.id \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 \
-trim 0 -C 50 -baq 1 \
-out ${PATH_OUTPUT}/${GFL_File} \
-nThreads 1 \
-minMapQ 15 \
-minQ 15 \
-SNP_pval 1e-6 \
-skipTriallelic 1 \
-minMaf 0.10 \
-minInd 67 \
-setMinDepth 190  -setMaxDepth 2850 -doCounts 1
```

This script generated the *glf.beagle.gz file that is the base for most of the later analyses. 
<br/><br/>

# IV) Phylogenetic relationships
The phylogenetic relationship were inferred using: 

+ **1:** NgsDIST to calculate the distance matrix between the taxa and the bootstrap distances matrices.

+ **2:** FASTME to infer the resulting phylogenetic tree.

+ **3:** RAXML to replace the bootstrap support value on the main tree.

+ **4:** The tree was mid-point rooted using R package ape.

+ **5:** The tree was plotted using the R package ggtree.

```{bash}
#! /bin/bash

#SBATCH --job-name=ngsDistWasp
#SBATCH --partition=regular
#SBATCH --mem=6GB
#SBATCH --cpus-per-task=1
#SBATCH --time=2-22:30:00

# To calculate the distance matrix : 
# https://github.com/mfumagalli/ngsTools/blob/master/TUTORIAL.md
# https://github.com/fgvieira/ngsDist

################
# Define paths #
################
PATH_RESULTS="~/Results/NJtree"
PATH_INPUTS="~/Results"
cat /data/p274105/Wasp_Pina/Data/New_pop.txt|cut -f 1 > $PATH_INPUTS/Labels.txt
FASTME="/data/p274105/tool_genomics/fastme-2.1.5/binaries/fastme-2.1.5-linux64"
	
##############
# NGSDIST
##############
# Programs needed
NGSDIST="/data/p274105/tool_genomics/ngsTools/ngsDist"

cd $PATH_INPUTS

GFL="Wasp_Pina_subset.glf.beagle.gz"

NSITES=`zcat $GFL| awk '{print $1}'| awk 'NR>1'|wc -l`
echo $NSITES

# 1000 boot, block 100 kb for whole genome SNPs (LD prunned)
module load GSL/2.5-GCC-8.2.0-2.31.1
$NGSDIST/ngsDist \
--n_threads 1 \
--verbose 1 \
--geno $PATH_INPUTS/$GFL \
--probs \
--n_ind  95 \
--pairwise_del \
--n_sites ${NSITES} \
--n_boot_rep 1000 \
--boot_block_size 5000 \
--labels $PATH_INPUTS/Labels.txt \
--out $PATH_RESULTS/output.dist

#--avg_nuc_dist \
echo "NGSDIST RAN"

##############
# FASTME
##############
$FASTME -i $PATH_RESULTS/output.dist \
-D 1001 \
-s \
-o  $PATH_RESULTS/output.nwk

head -n 1 $PATH_RESULTS/output.nwk > $PATH_RESULTS/output.main.nwk
tail -n +2 $PATH_RESULTS/output.nwk | awk 'NF' > $PATH_RESULTS/output.boot.nwk

echo "FASTME ran"

#########
# RAXML #
#########
cd $PATH_RESULTS
module load RAxML/8.2.11-foss-2018a-mt-avx2
raxmlHPC -f b \
-t $PATH_RESULTS/output.main.nwk \
-z $PATH_RESULTS/output.boot.nwk \d
-m GTRCAT \
-T 1 \
-n output_final.tree

echo "RAXML ran"
```

The resulting  phylogenetic tree was mid rooted and  then plotted in R using the libraries ggtree and ape.

```{r fig.height=10, fig.width=14, message=FALSE, warning=FALSE}
library(ggplot2)
library(ggtree)
library(ape)
library(phytools)
library(RColorBrewer)

setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/NJtree/")
tree <- ggtree::read.tree("RAxML_bipartitions.output_final.tree") # There are different read.tree() functions that executes different commands. To avoid any ambiguity here I add ape:: before to say that I want to use the read.tree() fonction from the ape package. This tree is a maximum likelihood tree of the 6 species of the porpoises family (Phoconidae) reconstructed with PhyML.
#tree = ape::root(tree,node=109, resolve.root=TRUE)

#######################################
# Transform the tree to a ggtree tree #
#######################################
tree2 <- tree
tree2 <- midpoint.root(tree)
tree2 <- ggtree(tree2)

########################################
# Define a color code for each species #
########################################
data_color = cbind(tree2$data$label[1:95], rep("test")) # Combine the column with individual names with a new column that will contain the color associated to each individual in an objected called data_color
colnames(data_color) =c("species","color")  # Give a name to the column of data_color
data_color = as.data.frame(data_color) # Transform data_color to a data frame
data_color[,2] = as.character(data_color[,2])  # Transform the color column into characters (insstead of factor)
pop = read.table("../../bam_location_correspondance.txt")
pop =  data.frame(inds=pop$V1, locality = pop$V2)
pop$color = "NAN"
data_color$locality = "NAN"

species = c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai")  # Create a vector with the unique species name
color = brewer.pal(10,"Paired")  # Create a vector with the color associated to each unique species name
info = cbind(species,color)  # Combine the unique species and the color in an object called info


# Loop to attribute to each individuals its color
for (i in 1:95){   
  for (j in 1:10) {   
    if(grepl(info[j,1],pop[i,2])){   
      pop[i,3] = info[j,2]
    }
  }
}

# Loop to attribute to each individuals its color
for (i in 1:95){   
  for (j in 1:95) {   
    if(pop[j,1]==data_color[i,1]){   
      data_color[i,2] = pop[j,3]
      data_color[i,3] = pop[j,2]
    }
  }
}


############################
# Define bootstraps groups #
############################
tree2$data$col <- cut(as.numeric(tree2$data$label),
                      breaks = c(0, 90, 100),
                      labels = c("<=90", ">90"))

for (i in 1:189){   
  for (j in 1:95) {   
    if(data_color[j,1]==tree2$data[i,4]){  
      tree2$data[i,4] = paste(tree2$data[i,4], data_color[j,3], sep = " ")
    }
  }
}

tree2$data$label = gsub("(\\w+) (\\w+)", "\\2",tree2$data$label, perl = TRUE)

#############################################
# Plot full tree to have the final topology #
#############################################
p <- ggtree(tree2$data) + geom_tiplab(size =4, color= data_color$color) + geom_point2(aes(subset=(!isTip), fill= tree2$data$col, size = tree2$data$col, shape=tree2$data$col))  + theme(legend.position="right")  + scale_fill_manual(breaks = c("<=90", ">90") , values = c("green","red")) + 
  scale_size_manual(breaks = c("<=90", ">90") , values = c(1.2,2.5)) + scale_shape_manual(breaks = c("<=90", ">90") , values = c(21,21)) + guides(size = FALSE, shape=FALSE, fill = guide_legend(override.aes=list(shape=21))) + labs(fill="Bootstraps") 
p = p + geom_treescale(x=0.30) + theme(legend.title = element_text(color = "black", size = 16),legend.text = element_text(color = "black", size = 14))  + theme(legend.background = element_rect(size=1, color = "black", linetype="solid")) 
p = p + ggplot2::xlim(0, 0.40) + theme(legend.position = c(0.1, 0.9))
p 
```
=> The tree is congruent with the analyses based on SNP calling.
<br/><br/>

# V) PCA
## 1) Run PCangsd
PCangsd was used to get the PCA as well as to estimate the admixture proportion (option -admix). 
It was ran with different value of K ranging from 1 to 8.
It was ran using the following command: 

for i in {1..5}; do sbatch ./PCangsd.sh $i; done

```{bash}
#! /bin/bash

#SBATCH --job-name=PCAngsd
#SBATCH --partition=short
#SBATCH --mem=2GB
#SBATCH --cpus-per-task=1
#SBATCH --time=00:25:00

#Script name: PCangsd.sh

################
# Define paths #
################

PATH_GFL="~/Results"
OUTPUT_PATH="~/Results/PCangsd"
OUTPUT="PCAangsd_results_K$1"
PATH_SCRIPT="~/Script/PCangsd"

module load PCAngsd/0.98-foss-2018a-Python-3.6.4

cd $PATH_GFL

pcangsd.py -beagle ${PATH_GFL}/Wasp_Pina_subset.glf.beagle.gz \
-admix \ # estimate admixture proportion
-admix_K $1 \ # Estimate admixture proportion with K = $1 groups.
-admix_iter 500 \ # Perform 500 EM iterations for admixture estimations.
-admix_alpha 50 \ 
-o ${OUTPUT_PATH}/${OUTPUT} \
-threads 1 \
-e 5 \ # Use 5 eigenvalues to model individual allele frequencies
-admix_tole 1e-7 \ # Tolerance value for update in admixture estimations
-minMaf 0.10 # Make sure the maf it at 0.
```
<br/><br/>

## 2) PC1 and PC2
The raw PCA was plotted using this R commands

```{r fig.height=5, fig.width=6, message=FALSE, warning=FALSE}
setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")
library(RcppCNPy)
library(ggplot2)
library(RColorBrewer)
library(grid)
library(gridExtra)
library(ggplot2)
library(reshape2)
library(ggtree)
library(cowplot)
library(tidyverse)
library(patchwork) 
library(plyr)
library(ggforce)

D <- npyLoad("PCAangsd_results_K5.cov.npy")
e <- eigen(D)
pop<-read.table("../../../bam_location_correspondance.txt")
colnames(pop)= c("ID","Location")
cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
population=unique(pop$Location)
myColors <- brewer.pal(10,"Paired")
All_data = cbind(e$vectors,pop)
data = cbind(All_data[,c(1,2)],pop)
colnames(data) = c("PC1","PC2","ID","Location")

data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
data$reproduction = "ToBeAdded"
data[data$Location=="Iriomote" | data$Location=="Okinawa" | data$Location=="Amami",5] = c("Sexual")
data[data$reproduction!="Sexual",5] = "Asexual"
repro = data[,5]
ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=reproduction)) + geom_point(size=2) + theme_bw() + 
  xlab(paste("PC1 (",round(eigenvalues[1,3],digits=2)," %)",sep="")) + 
  ylab(paste("PC2 (",round(eigenvalues[2,3],digits=2)," %)",sep="")) +
  scale_color_manual(values = myColors) + stat_ellipse(level = 0.95, size = 1)
```
=> The tree is consistent with the analyses based on SNP calling.


```{r}
##################################
# Plot eigenvalues contributions #
##################################
ggplot(data=eigenvalues[1:10,], aes(x=PC, y=Contributions)) +
  geom_bar(stat="identity", fill="steelblue") + 
  theme_bw() + 
  scale_x_continuous(breaks = unique(sort(eigenvalues$PC))) +
  ylab("Contribution (%)") +
  xlab("Principal Component")

scree<-eigenvalues[1:10,c(1,3)] %>%
  ggplot(aes(x=PC,y=Contributions))+theme_bw()+
  geom_col()+
  labs(title="Scree plot",y="",x="")+theme(
    title=element_text(size=8))
```

The contribution in % of each PC to the total variance of the PC is plot in the barplot above. 5 first PCs stand out and explain 61.8% of the total variance.

Below is the same scatter plot as before but trying to make it look more like the one of the manuscript:

```{r echo=FALSE, fig.height=5, fig.width=6, message=FALSE, warning=FALSE}
data = cbind(All_data[,c(1,2)],pop)
colnames(data) = c("PC1","PC2","ID","Location")
data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
data$reproduction = "ToBeAdded"
data[data$Location=="Iriomote" | data$Location=="Okinawa" | data$Location=="Amami",5] = c("Sexual")
data[data$reproduction!="Sexual",5] = "Asexual"
repro = data[,5]

paj <- ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=reproduction)) 
paj <- paj + geom_point(size=2)
paj <- paj + stat_ellipse(level = 0.95, size = 1)
paj <- paj + scale_color_manual(values = myColors) 
paj <- paj + geom_hline(yintercept = 0) 
paj <- paj + geom_vline(xintercept = 0) 
paj <- paj + theme_bw() +labs(x=paste("PC1 (",round(eigenvalues[1,3],digits=2)," %)",sep=""),y=paste("PC2 (",round(eigenvalues[2,3],digits=2)," %)",sep="")) +theme(legend.position = "none")
paj  + patchwork::inset_element(scree,0.05, 0.05, 0.35, 0.35)

```
<br/><br/>

## 3) Plot additional PCs (PC1 to PC5)

```{r fig.height=9, fig.width=14, message=FALSE, warning=FALSE}
plots = list()
counter = 1
for (i in 1:5){
  for (j in 1:5){
    if(j>i){
      if(j<5 || i<4){
  data = cbind(All_data[,c(i,j)],pop,repro)
  colnames(data) = c("PC1","PC2","ID","Location","Reproduction")
  data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
  plot = ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=Reproduction)) + geom_point(size=2) + theme_bw() + 
    xlab(paste("PC",i," (",round(eigenvalues[i,3],digits=2)," %)",sep="")) + 
    ylab(paste("PC",j," (",round(eigenvalues[j,3],digits=2)," %)",sep="")) +
    scale_color_manual(values = myColors) + theme(legend.position="none") #+ stat_ellipse(level = 0.95, size = 1)
  plots[[counter]] = plot
  counter = counter + 1
      }
     else{
       data = cbind(All_data[,c(i,j)],pop,repro)
       colnames(data) = c("PC1","PC2","ID","Location","Reproduction")
       data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai","Missing"))
       plot = ggplot(data, aes(x=-PC1, y=PC2,color=Location,shape=Reproduction)) + geom_point(size=2) + theme_bw() + 
         xlab(paste("PC",i," (",round(eigenvalues[i,3],digits=2)," %)",sep="")) + 
         ylab(paste("PC",j," (",round(eigenvalues[j,3],digits=2)," %)",sep="")) +
         scale_color_manual(values = myColors) +
         theme(legend.key.size = unit(0.5, 'cm')) #+ stat_ellipse(level = 0.95, size = 1)
       plots[[counter]] = plot
       counter = counter + 1
     }  
    }
  }
}

grid.arrange(grobs=plots, layout_matrix=rbind(c(1, 1, 1, 2, 2, 2, 3, 3, 3,4, 4, 4),
                                              c( 5, 5, 5, 6, 6, 6,7, 7, 7, 8, 8 ,8),c(9,9,9, 10,10,10,10,NA,NA,NA,NA,NA)))

```
<br/>

+ **PC1:** Separate asexual, Iriomote and the 2 other sexual.

+ **PC2:** Separate asexual, Iriomote and the 2 other sexual (so same as PC1)

+ **PC3:** Separate the asexual one into 2 groups with 3 individuals from Kobe being an hybrid between the two groups.

+ **PC4:** Separate Okinawa from Amami (the two sexual one excluding Iriomote)

+ **PC5:** No more group could be delineate after PC4. So no really new cluster is observed in PC5.

<br/><br/>

## 4) Plot admixture proportion estimated by pCangsd for K=1 to K=8 

```{r fig.height=7, fig.width=11, message=FALSE, warning=FALSE}

setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")
New_col =  c("darkcyan","#A52A2A", "#8A2BE2",  "darkolivegreen4","darkorchid4","black","orange","grey")

par(mfrow=c(4,2), mai = c(0.3, 0.3, 0.3, 0.4))
for (i in 1:8){
  D <- npyLoad(paste("PCAangsd_results_K",i,".cov.npy",sep=""))
  e <- eigen(D)
  pop<-read.table("../../../bam_location_correspondance.txt")
  colnames(pop)= c("ID","Location")
  cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
  eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
  colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
  population=unique(pop$Location)
  myColors <- brewer.pal(10,"Paired")
  All_data = cbind(e$vectors,pop)
  data = cbind(All_data[,c(1,2)],pop)
  colnames(data) = c("PC1","PC2","ID","Location")
  D2 <- npyLoad(paste("PCAangsd_results_K",i,".admix.Q.npy",sep=""))
  data = cbind(data,D2)
  data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
  data = with(data, data[order(Location),])
  to_plot = t(data[,5:(4+i)])
  colnames(to_plot) = NULL
  s = replace(rep(0, 95), !duplicated(as.character(data$Location)), 0.5)
  g <- table(data$Location)
  barplot(to_plot,col=New_col,main=paste("K",i,sep=""), border = NA, space = s)
  text(g / 2 + c(0, cumsum(g[-length(g)])), -0.15, names(g), srt = 25, xpd = NA)
}
```
<br/>
The results echo to PCA. The admixture proportion obtain at K=5 are very similar to the admixture proportion obtained with the software admixture. Here is a zoom at K=5.

```{r fig.height=3, fig.width=9, message=FALSE, warning=FALSE}

setwd("/Users/yacinebenchehida/Desktop/Metadata/Results/PCangsd/maf_0.05/")

for (i in 5){
  D <- npyLoad(paste("PCAangsd_results_K",i,".cov.npy",sep=""))
  e <- eigen(D)
  pop<-read.table("../../../bam_location_correspondance.txt")
  colnames(pop)= c("ID","Location")
  cumulative_variance = sum(unlist(e$values)) # Estimate the total variance of the data set
  eigenvalues = as.data.frame(cbind(seq(1:length(e$values)),e$values,(e$values*100/(cumulative_variance))))
  colnames(eigenvalues) = c("PC","Eigenvalue","Contributions")
  population=unique(pop$Location)
  myColors <- brewer.pal(10,"Paired")
  All_data = cbind(e$vectors,pop)
  data = cbind(All_data[,c(1,2)],pop)
  colnames(data) = c("PC1","PC2","ID","Location")
  D2 <- npyLoad(paste("PCAangsd_results_K",i,".admix.Q.npy",sep=""))
  data = cbind(data,D2)
  data$Location <- factor(data$Location, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
  data = with(data, data[order(Location),])
  to_plot = t(data[,5:(4+i)])
  colnames(to_plot) = NULL
  s = replace(rep(0, 95), !duplicated(as.character(data$Location)), 0.5)
  g <- table(data$Location)
  barplot(to_plot,col=New_col[1:5],main=paste("K",i,sep=""), border = NA, space = s)
  text(g / 2 + c(0, cumsum(g[-length(g)])), -0.15, names(g), srt = 25, xpd = NA)
}
```
<br/><br/>

# VI) NGSadmix
## 1) Run NGsadmix

NGSadmix was also used to estimate individual admixture proportion while taking into account incertainty in genotype.
Due to the instability of the results using the default setting, many of the default parameters were changed to perform a more stringent analyses. The explanations of all options is available here: http://www.popgen.dk/software/index.php/NgsAdmix.

The analyses was formed for K=1 to K=8 using 50 replicates for each K value. 
The script was ran using this command:

For K in {8..1}; do for replicate in {1..50}; do sbatch ./NgsAdmix_script.sh $K $replicate; done
```{bash}
#! /bin/bash

#SBATCH --job-name=NGSadm
#SBATCH --partition=short
#SBATCH --mem=7GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-00:30:00

#Script name: NgsAdmix_script.sh

################
# Define paths #
################
PATH_RESULTS="~/NGSadmix"
PATH_INPUT="~/Results"
module load angsd/0.925-foss-2018a

NGSadmix -likes $PATH_INPUT/Wasp_Pina_subset.glf.beagle.gz \
-K $1 \ # value of K to test (number of clusters)
-misTol 0.9 \ # Tolerance for considering a site as missing.
-tol 1e-8 \ # Tolerance for convergence. The smaller the more stringent.
-o ${PATH_RESULTS}/K"$1"_replicate_"$2" \ 
-P 1 \
-minInd 67 \
-maxiter 5000 \ # Maximum number of EM iterations. 
-minMaf 0.10 \
-tolLike50 0.3 # Loglikelihood difference in 50 iterations. The larger the more stringent.

```
<br/><br/>

## 2) Run clumpak
The raw results from NGSadmix (*opt files) were summarized using the software Clumpak. 
Clumpak estimates for each K value over the 50 replicates the different partition (in term of individual admixture proportion) inferred by NGSadmix.
```{bash}
#! /bin/bash

#SBATCH --job-name=clumpak
#SBATCH --partition=short
#SBATCH --mem=2GB
#SBATCH --cpus-per-task=1
#SBATCH --time=-00:30:00


#Script name: clumpak.sh

module load CLUMPAK/1.1-GCCcore-6.4.0-Perl-5.26.1

Path="~/Results/NGSadmix"
Data="~/Data"
cd $Path

zip myzip *opt

CLUMPAK.pl \
--id $((RANDOM%500)) \
--dir  $Path    \
--file  $Path/myzip.zip   \
--inputtype admixture \
--indtopop $Data/ind.txt \
--labels $Data/geo_group.txt
```

```{r echo=FALSE, include=TRUE, out.width="150%"}
knitr::include_graphics(path="Screenshot 2022-07-11 at 12.04.51.png")
```

Out of the 50 replicates I obtained:

+ **K=2** 50/50

+ **K=3** 50/50

+ **K=4** 50/50

+ **K=5** 32/50 (major partition) 18/50 (minor partition: not shown). 

=> The clustering obtained at K=5 (major partition) is the same as the one obtained at using Admixture and the PCA.
<br/><br/>

## 3) Best K?

Clumpak implements also two approaches to estimate the best K:

+ **The method of Evanno** 

+ **ln(Pr(X|K) or the Prichard K** 

They were calculated using the following script:
Because some replicates had exactly the same likelihood that creates a conflict with the way this method work. There I used a home made bash command (line 763) to jitter (add some noise to) the decimal value of the estimated likelihood.
```{bash}
module load CLUMPAK/1.1-GCCcore-6.4.0-Perl-5.26.1

Path="~/Results/NGSadmix"
cd $Path

cat *K*log |grep "best"|awk -F" " '{print $2}'|cut -c 6-24 > likelihood.txt
ls *K*log |cut -d _ -f 1 | sed 's/K//g' > kvalue.txt
paste kvalue.txt likelihood.txt > likelihood_bestK_input.txt
cat likelihood_bestK_input.txt | while read line; do echo $line|perl -pe 's/(\.)(\d+)$/$1"'$RANDOM'"/g'| sed 's/"//g'; done > new_likelihood_bestK_input.txt


perl /software/software/CLUMPAK/1.1-GCCcore-6.4.0-Perl-5.26.1/BestKByEvanno.pl --id 121 --d $Path --f $Path/new_likelihood_bestK_input.txt --inputtype lnprobbyk
```
<br/>
According to the Evanno method **K=4** is the best K value
<br/><br/>
According to the Pritchard K method **K=8** is the best K value
<br/><br/>
This results must be taken with a pinch of salt. Evanno Delta(K) has been shown to be quite unreliable. The pritchard K was designed to work well with the results of the software STRUCTURE. With other sotfware, my experience show that it always picks the larger number of K. There is clearly a data overfitting problem. 
<br/><br/>

# VII) FST
## 1) Compute GL for each population and Site frequency spectrum for each population
<br/>
The calculation of the FST  with angsd rely on the SFS. The SFS must be estimated separately for each population of interest.
The FST were computed between:

+ **1)** 5 (Ks) population inferred with the clustering approaches (NGSadmix, Admixture, PCangsd). 

+ **2)** The 10 geographical locations.

The SFS for each population were estimated using the following command:
```{bash}
#! /bin/bash

#SBATCH --job-name=GL
#SBATCH --partition=regular
#SBATCH --mem=10GB
#SBATCH --cpus-per-task=1
#SBATCH --time=0-8:00:00

################
# Define paths #
################
module load HTSlib/1.12-GCC-10.2.0

BAM1_LIST="~/Data/BAM_FILES/BAM_$1.txt"
PATH_RESULTS="~/FST/GFL/$1"
GFL_File1="$1_intersect.glf"
REF="~/Data/ref_genome/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"

mkdir -p  $PATH_RESULTS

#######################################################
# COMPUTE MIN SAMPLE SIZE AND MIN/MAX DEPTH for POP 1 #
#######################################################  
NB_SAMPLES=$(cat $BAM1_LIST|wc -l) 
MINDEPTH=$((NB_SAMPLES*2)) # Min depth 2X  
MAXDEPTH=$((NB_SAMPLES*30)) # Max depth 30 X
MIN_sample=$((($NB_SAMPLES * 80 ) / 100)) # Min number of individuals 80% 

echo $MINDEPTH
echo $MAXDEPTH
echo $MIN_sample

#########################################
# COMPUTE GENOTYPE LIKELIHOODS FOR POP1 #
#########################################
$ANGSD/angsd -GL 2 \
-ref ${REF} \
-anc ${REF} \
-bam $BAM1_LIST \
-uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 \
-trim 0 -C 50 -baq 1 \
-out ${PATH_RESULTS}/${GFL_File1} \
-nThreads 1 \
-minMapQ 20 \
-minQ 20 \
-doSaf 1 \ # To estimate the site frequency spectrum
-minInd $MIN_sample \
-doCounts 1 \
-setMinDepth $MINDEPTH -setMaxDepth $MAXDEPTH 
```

Because we are not using an outgroup to get the ancestral state, the SFS must be folded. HOWEVER, the folding of the SFS should be performed in a later step This is why:

```{r echo=FALSE, include=TRUE, out.width="150%"}
knitr::include_graphics(path="Screenshot 2022-07-10 at 20.04.34.png")
```

Also because the analyses is based on the SFS, monomorphic and rare variants must be kept to infer an unbiaised SFS. Therefore the flags on the minor allele frequency and on SNP P-value are not used here. 

```{r echo=FALSE, include=TRUE, out.width="150%"}
knitr::include_graphics(path="Screenshot 2022-07-10 at 20.13.10.png")
```


Applying these filters I retained between **16,231,719 and 202,084,034 sites** (not SNPs) depending on the population. 
<br/><br/>

## 2) Estimate 2DSFS + compute FST based on as many sites as possible
+ **1:** Compute the folded 2D SFS each pair of population.

+ **2:** Index the SFS.

+ **3:** Get the global estimate of the FST for each pair of population.

+ **4:** Compute the weighted FST over 10kb windows.

```{bash}
#! /bin/bash

#SBATCH --job-name=SFS_FST
#SBATCH --partition=gelifes
#SBATCH --mem=35GB
#SBATCH --cpus-per-task=8
#SBATCH --time=0-20:00:00

# cript name: SFS_FST.sh

################
# Define paths #
################
module load HTSlib/1.12-GCC-10.2.0

BAM1_LIST="~/Data/BAM_FILES/BAM_$1.txt"
BAM2_LIST="~/Data/BAM_FILES/BAM_$2.txt"
PATH_RESULTS="~/Results/FST/GFL/$1_$2"
GFL_File1="Wasp_Pina_$1.glf"
GFL_File2="Wasp_Pina_$2.glf"
REF="~/Data/ref_genome/ajConsensus.fasta"
ANGSD="/data/p274105/tool_genomics/angsd"

mkdir -p  $PATH_RESULTS

##########
# ML SFS #
##########
#calculate the 2dsfs prior
$ANGSD/misc/realSFS  $PATH_RESULTS/*"$1".glf.saf.idx $PATH_RESULTS/*"$2".glf.saf.idx -fold 1 -P 8 > $PATH_RESULTS/"$1"_"$2".ml
echo "2DSFS READY"

#prepare the fst for easy window analysis
$ANGSD/misc/realSFS fst index $PATH_RESULTS/*"$1".glf.saf.idx $PATH_RESULTS/*"$2".glf.saf.idx -sfs $PATH_RESULTS/"$1"_"$2".ml -fold 1 -whichFst 1 -fstout $PATH_RESULTS/"$1"_"$2" -P 8
echo "DATA READY FOR SFS CALCULATION"

#get the global estimate
echo -e Unweighted"\t"Weighted > $PATH_RESULTS/"$1"_"$2"_results.fst
$ANGSD/misc/realSFS fst stats $PATH_RESULTS/"$1"_"$2".fst.idx -fold 1 -whichFst 1 >> $PATH_RESULTS/"$1"_"$2"_results.fst
echo "GLOBAL FST COMPUTED"

# FST in sliding windows of 50 kb
$ANGSD/misc/realSFS fst stats2 $PATH_RESULTS/"$1"_"$2".fst.idx -win 10000 -step 10000 -fold 1 -whichFst 1 > $PATH_RESULTS/Sliding_windows_FST_"$1"_"$2".txt
echo "SLIDING WINDOW FST COMPUTED"

rm $PATH_RESULTS/*gz
# GL for everything. Including monomorphic sites. GL estimate for each pop separately. Then used to estimate SFS within pop. Then used to estimate the 2D SFs. Finally 2Dsfs is used to estimate the FST. FST were computed globally + within windows of 50kB. 

```

The -whichFST option was used because it performs better with low sample sizes.

To avoid double comparisons (e.g. K1 with K2 and K2 with K1) this script was run as follow:

"for i in K{1..5}; do for j in K{1..5}; do if [ "$i"  \<  "$j" ]; then sbatch ./SFS_FST $i $j; fi; done; done"

<br/><br/>

## 3) Average FST values + 95% confidence interval (95CI)

The following R command was used to estimate the 95CI. It used the R package Rmisc which assume that the data are t-distributed (number of window-1 degree of freedom)
```{r}
suppressPackageStartupMessages(library(Rmisc))
library(Rmisc)
args = commandArgs(trailingOnly=TRUE)
fst = read.table(args)
CI(unlist(fst), ci=0.95)
```

The previous R command was ran inside the following bash script:

```{bash}
module load R
cd GFL

(echo -e Pop1"\t"Pop2"\t"Global_FST"\t"Mean_FST"\t"Low_CI_FST"\t"High_CI_FST

for i in $(ls| grep -P "K\d_K\d")
	do 
		POP=$(echo $i| perl -pe 's/_/\t/g')
		cd $i
		GLOBAL_FST=$(cat *fst|awk '{print $2}'|awk 'NR>1 {print $0}')
		cat Sliding*|awk '{print $5}' > bizbizfaitlamouche.txt
		MEAN_FST=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 2 -d " ")
		LOW_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 3 -d " ")
		HIGH_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 1 -d " ")  
		echo -e $POP"\t"$GLOBAL_FST"\t"$MEAN_FST"\t"$LOW_CI"\t"$HIGH_CI
		rm bizbizfaitlamouche.txt 
		cd ..
	 done) > ../FST_between_NGSadmix_GROUP.txt



(echo -e Pop1"\t"Pop2"\t"Global_FST"\t"Mean_FST"\t"Low_CI_FST"\t"High_CI_FST

for i in $(ls| grep -v -P "K\d|maf")
        do
          	POP=$(echo $i| perl -pe 's/_/\t/g')
                cd $i
                GLOBAL_FST=$(cat *fst|awk '{print $2}'|awk 'NR>1 {print $0}')
                cat Sliding*|awk '{print $5}' > bizbizfaitlamouche.txt
                MEAN_FST=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 2 -d " ")
                LOW_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 3 -d " ")
                HIGH_CI=$(Rscript --vanilla ../../FST_CI.R bizbizfaitlamouche.txt|awk 'NR>1 {print $0}'| cut -f 1 -d " ")
                echo -e $POP"\t"$GLOBAL_FST"\t"$MEAN_FST"\t"$LOW_CI"\t"$HIGH_CI
                rm bizbizfaitlamouche.txt
                cd ..
         done) > ../FST_between_GEO_GROUP.txt
```

### a) FST based on NGSadmix groups
```{r echo=FALSE, message=FALSE, warning=FALSE}
read.table("FST_between_NGSadmix_GROUP.txt",header =T)
```
### b) FST based on Geographic groups
```{r echo=FALSE, message=FALSE, warning=FALSE}
read.table("FST_between_GEO_GROUP.txt",header =T)
```
<br/><br/>

## 4) Visualization with heatmaps (based on the Mean_FST)
### a) FST based on NGSadmix groups
```{r echo=TRUE, message=FALSE, warning=FALSE}

library(viridis)
library(ggtext)
FST = read.table("FST_between_NGSadmix_GROUP.txt",header =T)
FST = FST[,c(1,2,4,5,6)]
FST$To_plot = paste0('**', round(FST$Mean_FST,3),'**<br />[',round(FST$Low_CI_FST,3),'-',round(FST$High_CI_FST,3),']' , sep = "")

p = ggplot(data =FST, aes(x=Pop1, y=Pop2, fill=Mean_FST)) + 
geom_tile(colour = "black") + scale_fill_gradientn(colours = viridis(10000)) +  theme(plot.title = element_text(hjust = 0.5)) + labs(fill = "FST") + xlab("Population 1") + ylab("Population 2") + theme_bw() +   geom_textbox(label = FST$To_plot, box.size = 0, fill = NA, halign = 0.5,  colour = ifelse(FST$Mean_FST < 0.20,'white','black'),size=4) + guides(colour = FALSE) + scale_color_manual(values=c('black','white')) +
theme(axis.text.x = element_text(colour = New_col[1:4], size = 11,face="bold")) +
theme(axis.text.y = element_text(colour = New_col[2:5], size = 11,face="bold")) + 
theme(legend.title = element_text(size=16, face = "bold")) +
theme(axis.text.x = element_text(angle = 45,  hjust = 1)) 

p


```

=> The results are overall consistent with SNP calling results. There are still three differences:

+ **a:** The FST values are lower. 

+ **b:** The FST values between K4 and K5 (between asexual pop.) tend to be larger (relatively speaking compared to the rest)

+ **c:** The FST values between the sexual population Iriomote (K1) vs K2/K3 (Amami/Okinawa) show lower FSt values than comparisons between Iriomote and K4/5 (asexual). The difference was less pronounced in the analysis performed by Pina:
<br/>

<center>

```{r echo=FALSE, include=TRUE, out.width="150%", fig.align = "center"}
knitr::include_graphics(path="Screenshot 2022-07-11 at 17.28.01.png")
```
<center>
<br/><br/>

### b) FST based on Geographic groups
```{r echo=TRUE, fig.height=8, fig.width=15, message=FALSE, warning=FALSE}
library(reshape2)
library(dplyr)

FST = read.table("FST_between_GEO_GROUP.txt",header =T)
FST = FST[,c(1,2,4,5,6)]
FST$To_plot = paste0('**', round(FST$Mean_FST,3),'**<br />[',round(FST$Low_CI_FST,3),'-',round(FST$High_CI_FST,3),']' , sep = "")

for (row in 1:dim(FST)[1]){
  if(FST[row,c(1)]!=FST[row,c(2)]){
   FST = rbind(FST,FST[row,])
   tmp_pop1 = FST[nrow(FST),c(1)]
   tmp_pop2 = FST[nrow(FST),c(2)]
   FST[nrow(FST),c(2)] = tmp_pop1
   FST[nrow(FST),c(1)] = tmp_pop2
   }
}


FST$Pop1 <- factor(FST$Pop1, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Sendai","Tokyo"))
FST$Pop2 <- factor(FST$Pop2, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Sendai","Tokyo"))
FST = arrange(FST, Pop1, Pop2)


list = list()
a = 1
for (i in 1:dim(FST)[1]){
  for (j in 1:dim(FST)[1]){
    if(j > i){
      if(FST[i,1]==FST[j,2] && FST[i,2]==FST[j,1]){
        list[[a]] = FST[i,]
        a = a + 1
      }
    }
  }
}

FST = do.call(rbind,list)


p = ggplot(data =FST, aes(x=Pop1, y=Pop2, fill=Mean_FST)) + 
geom_tile(colour = "black") + scale_fill_gradientn(colours = viridis(10000)) +  theme(plot.title = element_text(hjust = 0.5)) + labs(fill ="FST") + xlab("Population 1") + ylab("Population 2") + theme_bw() +  geom_textbox(label = FST$To_plot, box.size = 0, fill = NA, halign = 0.5, colour = ifelse(FST$Mean_FST < 0.20,'white','black'),size=5) + guides(colour = FALSE) + scale_color_manual(values=c('black','white')) +
theme(axis.text.x = element_text(colour = myColors[1:9], size = 16,face="bold")) +
theme(axis.text.y = element_text(colour = myColors[2:10], size = 16,face="bold")) + 
theme(legend.title = element_text(size=16, face = "bold")) +
theme(axis.text.x = element_text(angle = 45,  hjust = 1)) 

p

```
=> Consistent with all the analyses. 
<br/><br/>

# VIII) Gentic distance matrix (for subset of 54 individuals)
This part allow to subset the matrix of genetic distance calculated by NGSdist 54 individuals. 
As requested by the reviewers, this matrix is going to be correlated with the microbiome composition. 
The aim is to investigate if more genetically distant populations tend to have a more differentiated microbiomes too. 

This first bash block separate the genetic distance matrix estimated by NGSdist from the bootstraped one and prepare (adjust label names) the the list of individuals to be subsetted later. 
```{bash echo=TRUE}
# Extract the genetic distance matrix (from the bootstrapped ones)
cat NJtree/output.dist | awk -v 'RS=\n\n' '1;{exit}'|awk 'NR>2'|perl -pe 's/^\n//g' > distance_matrix.txt

# Remove the sorted in the label names
cat subset*|perl -pe 's/^\n//g'|perl -pe 's/_sorted//g' > ind_for_matrix.txt
```
<br/>
Here is the list of 54 individuals that are going to be extracted:
```{r}
# Import the real ID (with geo names)
ind_to_subset = read.table("ind_for_matrix.txt")
names(ind_to_subset) = "ID"
ind_to_subset
```
<br/>
This R code block extract the genetic distances estimated by NGSDist for a subset 54 individuals (with microbiome data).
```{r message=FALSE, warning=FALSE}
# Import the distance matrix in R
dist_matrix = read.table("distance_matrix.txt")
dist_matrix_names = dist_matrix[,c(1)]
dist_matrix = dist_matrix[,c(-1)]
colnames(dist_matrix) = dist_matrix_names
rownames(dist_matrix) = dist_matrix_names


# Transform the bam ID to the real ID
test1 = read.table("../correspondance_bam_ID_location.txt")
test1 = cbind(paste(test1$V1,test1$V2,sep="_"), test1$V3)
test1 = as.data.frame(test1)
test1 = test1[-c(1),]


for (i in 1:dim(test1)[1]){
  for (j in 1:dim(ind_to_subset)[1]){
    if (test1[i,1]==ind_to_subset[j,]){
      ind_to_subset[j,]=test1[i,2]
    }
  }
}

# Subset the distance matrix for the individuals for which microbiomes data are available
dist_matrix = dist_matrix[rownames(dist_matrix) %in% unlist(ind_to_subset),colnames(dist_matrix )%in% unlist(ind_to_subset)]

# Reverse distance matrix names to use the real geographical ID
for (i in 1:dim(test1)[1]){
  for (j in 1:dim(ind_to_subset)[1]){
    if (colnames(dist_matrix)[j]==test1[i,2]){
      colnames(dist_matrix)[j]=test1[i,1]
      rownames(dist_matrix)[j]=test1[i,1]
    }
  }
}

order_matrix = read.table("ind_for_matrix.txt")
dist_matrix = dist_matrix[unlist(order_matrix),unlist(order_matrix)]
dist_matrix

```

<br/><br/>

# IX) Kinship (for subset of 54 individuals)
<br/>
PCangsd was used to estimate the relatedness between each pair of individuals while taking into account the genetic structure (option -kinship). The method implemented in ANGSD is based on the PCrelate approach (https://www.cell.com/ajhg/fulltext/S0002-9297(15)00493-0).

For the sake of simplicity, I first ran PCrelate for the 95 available individuals. Then afterward I sub-sampled the resulting relatedness matrix to have the relatedness coefficient just for the 54 individuals for which microbiome data are available.

PCrelate was ran on the LD pruned data set using the below script:
```{bash}
#! /bin/bash

#SBATCH --job-name=PCrel
#SBATCH --partition=short
#SBATCH --mem=1GB
#SBATCH --cpus-per-task=1
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=h.y.ben.chehida@rug.nl
#SBATCH --time=00:30:00


# PCAngsd needs and accepts are genotype likelihoods in Beagle format.
# We generated it already for the NGSadmix analysis. So we can reuse it here.

################
# Define paths #
################
GFL="/data/p274105/Wasp_Pina/Results/Wasp_Pina_subset.glf.beagle.gz"
OUTPUT="/data/p274105/Wasp_Pina/Results/PCrelate"

module load PCAngsd/0.98-foss-2018a-Python-3.6.4
module load R/4.0.0-foss-2020a   

#################################
# Estimate pairwise relatedness #
#################################
pcangsd.py \
-beagle $GFL  \
-o $OUTPUT/inbreeding_1/Res_1 \
-threads 1 \
-inbreed 1 \
-inbreed_iter 1000 \
-kinship \
-inbreed_tole 1e-6
```

<br/> 

The R code below:

+ **1:** Extract the pairwise relatedness coefficients for the 54 individuals of interests.

+ **2:** Set the negative relatedness values to 0.

+ **3:** Visually summarize the results using a default heat map (I can modify it so it looks like the FST one if needed).


```{r fig.height=12, fig.width=12}
#!/usr/bin/env Rscript
library(RcppCNPy) # library to read numpy files
options(scipen=999) # use full number instead of scientific notation
library(pheatmap) 
library(dplyr)


kinship <- npyLoad(filename="PCrelate/inbreeding_1/Res_1.kinship.npy") #Import kinship matrix in numpy format
kinship = as.data.frame(kinship) # transform the matrix to a data frame
individuals <- read.table("../bam_location_correspondance.txt") # individuals name and pop of each individual
colnames(kinship) =unlist(individuals[,c(1)]) # add individuals name to each column 
rownames(kinship) =unlist(individuals[,c(1)]) # add inviduals name to each row
kinship = kinship[rownames(kinship) %in% unlist(ind_to_subset),colnames(kinship )%in% unlist(ind_to_subset)]
kinship[kinship < 0] = 0 # kinship values inferior to 0 are not biologically meaningful. They are set to 0.
#kinship[kinship > 0.25] = 1
individuals = individuals[individuals$V1  %in% unlist(ind_to_subset),]


geography = as.data.frame(individuals[,2],stringsAsFactors=TRUE) # store information on geography in a separate object
rownames(geography) = individuals[,1]
colnames(geography) = "Geography"
geography$Geography <- factor(geography$Geography, levels= c("Iriomote","Okinawa","Amami","Kagoshima","Fukuoka","Kobe","Kyoto","Kanazawa","Tokyo","Sendai"))
geography$Geography = sort(geography$Geography)

# Block to assign desired colors to the desired populations. 
mycolors = cbind(unique(geography),myColors) 
geo = geography

geography$colors = "empty"
for (i in 1:dim(geography)[1]){
  for (j in 1:dim(mycolors)[1]){
    if(geography[i,1]==mycolors[j,1])
      geography[i,2] = mycolors[j,2]
  }
}

# Prepare the colors in the format desired by pheatmap
mat_colors <- list(Geography = myColors)
names(mat_colors$Geography) <- unique(unlist(geo))

# Pheatmap plot
pheatmap(t(kinship),na_col="white",annotation_col=geo,annotation_row=geo,cluster_rows=F,cluster_cols=F,annotation_colors = mat_colors)

```

```{r}
max(unlist(kinship[kinship < 0.25]))
```
The maximal relatedness values is equal to **0.0407**

<br/> 


If we look at the distribution of the estimated relatedness values
```{r}
plot(density(unlist(kinship[kinship < 0.25])))

```
Basically none of the individuals are related. 

This result strongly contrasts with the results obtained using SNPrelate:
```{r echo=FALSE,  include=TRUE, out.width="75%"}
knitr::include_graphics(path="Screenshot 2022-07-20 at 11.56.25.png")
```

With SNPrelate many of the values are very high. 

<br/> 

The following R code:

+ **1:** Label individuals according to their geographic ID (not the one in the bam files starting with Sx)

+ **2:** Create and display the matrix values that I will share with Pina (for correlation with the microbiome)


```{r}
# Reverse distance matrix names to use the real geographical ID
for (i in 1:dim(test1)[1]){
  for (j in 1:dim(ind_to_subset)[1]){
    if (colnames(kinship)[j]==test1[i,2]){
      colnames(kinship)[j]=test1[i,1]
      rownames(kinship)[j]=test1[i,1]
    }
  }
}

# order matrix geographically
order_matrix = read.table("ind_for_matrix.txt")
kinship = kinship[unlist(order_matrix),unlist(order_matrix)]

kinship = round(kinship,digits=4)

# Create a text file with the relatedness matrix
write.table(kinship, file="matrix_relatedness_54_ind.txt",sep="\t",quote=FALSE,row.names = TRUE,
            col.names = TRUE)

# Display matrix
kinship
```

Write final matrix of relatedness (handle missing tab in the beginning)
```{bash}
cat matrix_relatedness_54_ind.txt | head -n 1 | perl -pe 's/I_4/\tI_4/g' > matrix_relatedness.txt
cat matrix_relatedness_54_ind.txt| awk 'NR>1' >> matrix_relatedness.txt
rm matrix_relatedness_54_ind.txt
mv matrix_relatedness.txt matrix_relatedness_54_ind.txt
```

