Introduction

Rapeseed (Brassica napus L.) is an important source of edible oil and protein-rich livestock feed in the world. B. napus (AACC) was ancestrally originated from an interspecific hybridization between two diploid progenitors, B. rapa (AA) (n = 10) and B. oleracea (CC) (n = 9), less than 7500 years ago. In our previous study, we resequenced a world-wide collection of 991 B. napus gerplasm accessions, including 658 winter types, 145 semi-winter types and 188 spring types, from 39 countries (Wu et al., 2019).

 

The geographic distribution of rapeseed accessions

In genetics, a genome-wide association study (GWAS), also known as whole genome association study (WGAS), is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWAS typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major agronomic traits.

In order to make better use of this huge B. napus gerplasm accessions, we develop this interactive application (BnaGWAS) in R with Shiny. This application can conduct GWAS, visualization of GWAS results (Manhattan plot and QQ plot), extraction of significant genes and annotation of genes.

Data input

phenotype data (.txt)

Noted: Your Samples Uploaded MUST Be The 300 Core Collection Samples Used Here ! So If Some Samples Are Not In Your LIST, You Need Add Them In Your List, And Set The Value NA. If Some Samples In Your List Are Not In The 300 Core Collection Samples Here, JUST REMOVE THEM!

I highly recommended first download the example of the expected input phenotype dataset below, and then replace the phenotype values with you own data.

You just need upload your phenotype data to run GWAS. Here we just use the 300 core collection gerplasm which represent the most of genetic resources of 1000 B. napus gerplasm accessions. an example of the expected input data format is present as below:

R4157 0.859791123
R4158 0.87369142
R4163 0.842593709
R4168 0.884782609
R4171 NA
R4176 0.885619807
R4177 0.885884455
R4179 0.879374612
R4180 0.878567797
R4182 0.868825911

Where, column one correspond to samples, column two correspond to phenotype values.

An example of the expected input phenotype dataset can be accessible here.

Other parameters

Next you need enter your trait name (recommended) (default: Bna_trait). Now just support the EMMAX model. After all the prepared works are ready, then clink Run Analysis to start GWAS.

Visualization

For the Visualization section in this App, it is aiming to visualize the Manhattan plot and QQ plot. You can choose the alternate colors for alternate chromosomes and p-value threshold (default: p-value=5).

Extraction

The extraction of significant genes is based on the significant p-value of SNPs. So here you need choose the p-value threshold and the distance up/down-stream of SNPs (recommended) (default: 75kb).

Annotation

This section is designed for gene annotation based on different databases (eggNOG, GO, KEGG, NR, etc.).

Analysis Workflow

Step 1: Data Input Upload your phenotype data via RUN GWAS section, a distribution plot of your input data will be presented.

Step 2: visualization generate Manhattan plot and QQ plot.

Step 3: Extraction obtain the significant genes.

Step 4: Annotation annotate significant genes based on different databases.

Frequently Asked Questions and Answers

Q1. How to organize my own phenotype data?

  A: Now the input data MUST be .txt format. I highly recommended first download the example of the expected input phenotype dataset from here, and then replace the phenotype values with you own data.

Q2. Can I choose other models (GLM, MLM, CMLM, etc.) to run GWAS?

  A: NO, this app can only run the EMMAX model. It is easy to implement other models for GWAS, but other models need more time to run.

Q3. Can I run the GWAS of quality trait?

  A: Yes! quantitative trait and quality trait are fully support.

Q4. How to choose the distance up/down-stream for gene extraction?

  A: This dependents on your GWAS results and Linkage Disequilibria of your sample. Here we recommend 75kb~100kb.

Feedback

This App is developed and maintained by Tao Yan at the JiangLX Lab, Institute of Crop science, College of Agriculture and Biotechnology (CAB), Zhejiang University.

If you have any questions, comments, or suggestions, feel free to contact the developer at tyan<at>zju.edu.cn.

Step1: Upload Your Phenotype File




Step2: Select Reference Genome, GWAS Model and Enter Trait Name

Distribution Of Your Phenotype

Loading...

If you use EMMAX to publish research, please cite:

Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, Sabatti C, Eskin E. (2010) Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42:348-54.

Customized the visualization of Manhattan Plot and QQ Plot

Extract genes based on significant SNPs

Genes extracted based on significant SNPs

Gene annotation based on different databases

If you are sure your previous analysis are right, then clink the button: Run Annotation


利用EMMAX进行GWAS分析

文件准备

利用EMMAX进行GWAS分析需要以下文件

基因型文件

基因型文件直接使用VCF进行过滤后得到,具体如下(我这里原始的VCF文件是test.vcf),过滤前先将染色体名称全部换为数值型(不然后续GWAS分析会报错的),同时添加SNP ID,这个比较简单就不讲了! 另外如果测序深度较低,可以基因型填补一下,按研究需要吧,这里就不进行了。

  • 按基因型等位频率(0.05)以及缺失率(0.1)进行过滤
plink --vcf test.vcf --maf 0.05 --geno 0.1 --recode vcf-iid --out test.maf0.05.int0.9

生成test.maf0.05.int0.9.vcf文件

  • 依据LD对标记进行筛选(这一步SNP必须有ID才行,GATK等软件生成的VCF文件是没有ID的)
plink --vcf test.maf0.05.int0.9.vcf --indep-pairwise 50 10 0.2 --out test.maf0.05.int0.9

产生test.maf0.05.int0.9.prune.intest.maf0.05.int0.9.prune.out两个文件,in表示入选的标记,out表示被去除的标记,这两个文件都是只含有SNP ID,根据ID进行筛选

  • 提取筛选结果
plink --vcf test.maf0.05.int0.9.vcf --make-bed --extract test.maf0.05.int0.9.prune.in --out test.maf0.05.int0.9.prune.in
  • 将筛选后的数据转换为vcf格式
plink --bfile test.maf0.05.int0.9.prune.in --recode vcf-iid --out test.maf0.05.int0.9.prune.in

得到test.maf0.05.int0.9.prune.in.vcf基因型文件

  • 将筛选后的文件转换为admixture数据格式用于群体结构分析
plink -bfile test.maf0.05.int0.9.prune.in --recode 12 --out test.maf0.05.int0.9.prune.in

得到test.maf0.05.int0.9.prune.in.ped用于admixture分析

  • 将基因型文件转换为EMMAX格式
plink --vcf test.maf0.05.int0.9.vcf --recode 12 transpose --output-missing-genotype 0 --out test.maf0.05.int0.9 --autosome-num 90

得到tfam文件

  • 群体亲缘关系分析
emmax-kin test.maf0.05.int0.9 -v -h -d 10 -o test.maf0.05.int0.9.BN.kinf

### 如果你下载的emmax软件是inter这个版本的,这里需要注意,命令有点不同
emmax-kin-inter64 test.maf0.05.int0.9 -v -d 10 -o test.maf0.05.int0.9.BN.kinf

群体结构文件

首先admixture群体结构分析,可以添加参数j,多线程

for i in {1..15}
do
admixture --cv test.maf0.05.int0.9.prune.in.ped ${i} -j48 >> log.txt
done

查看最佳分群

$ grep CV log.txt
CV error (K=1): 0.43802
CV error (K=2): 0.41492
CV error (K=3): 0.40182
CV error (K=4): 0.39723
CV error (K=5): 0.39281
CV error (K=6): 0.38899
CV error (K=7): 0.38640
CV error (K=8): 0.38434
CV error (K=9): 0.38301
CV error (K=10): 0.38156
CV error (K=11): 0.38050
CV error (K=12): 0.38006
CV error (K=13): 0.37869
CV error (K=14): 0.37795
CV error (K=15): 0.37830

K=14时CV值最小,所以最佳分群为14 K=14时候的Q矩阵用于后续分析。admixture产生的Q矩阵格式如下(前6行)

0.000010 0.864016 0.039019 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.096855 0.000010 0.000010 0.000010 0.000010
0.000010 0.135428 0.000010 0.039195 0.000010 0.192606 0.195391 0.062418 0.000010 0.124034 0.214463 0.000010 0.028383 0.008032
0.000010 0.613532 0.000010 0.043222 0.000010 0.000010 0.272258 0.000010 0.000010 0.000010 0.058748 0.000010 0.012151 0.000010
0.000010 0.419498 0.417594 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.000010 0.090515 0.062396 0.009908
0.012456 0.045110 0.000010 0.000010 0.000010 0.021304 0.202956 0.049112 0.001099 0.455700 0.011129 0.093843 0.107251 0.000010
0.000010 0.071372 0.004548 0.000010 0.004779 0.000010 0.326421 0.009612 0.003101 0.211292 0.350671 0.000010 0.018153 0.000010

因为文件相对较小,手动添加表头信息,以及删除最后一列,保存14.Q用于后续分析,之后如下:

<Covariate>
<Trait> Q1      Q2      Q3      Q4      Q5      Q6      Q7      Q8      Q9      Q10     Q11     Q12     Q13
R4155   0.00001 0.864016        0.039019        0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.096855        0.00001 0.00001 0.00001
R4156   0.00001 0.135428        0.00001 0.039195        0.00001 0.192606        0.195391        0.062418        0.00001 0.124034        0.214463        0.00001 0.028383
R4157   0.00001 0.613532        0.00001 0.043222        0.00001 0.00001 0.272258        0.00001 0.00001 0.00001 0.058748        0.00001 0.012151
R4158   0.00001 0.419498        0.417594        0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.00001 0.090515        0.062396
  • 表型数据

EMMAX每次只能运行一个性状,格式与tfam格式类似,所以需要进行拆分转换,拆分前,需要将缺失值-999替换为NA。 最开始我的表型数据如下:

<Trait> day
R4155   184
R4156   181
R4157   181
R4158   182
R4159   170
R4160   179

利用R进行转换

tfam <- read.table("test.maf0.05.int0.9.tfam", header = F, stringsAsFactors = F)
tr <- read.table("test.trait.txt", header = T, check.names = F, stringsAsFactors = F)
tr <- tr[match(tfam$V1, tr$`<Trait>`),] ###把基因型文件样本顺序和表型文件样本顺序调整为一致
tr[tr == -999] <- NA
tre <- cbind(tr[,1], tr)
write.table(tre, file = "test_trait_emmax.txt", col.names = F, row.names = F, sep = "\t", quote = F)

转换后结果如下:

R4155   R4155   184
R4156   R4156   181
R4157   R4157   181
R4158   R4158   182
R4159   R4159   170
  • 协变量文件

这里以群体结构作为协变量

tfam <- read.table("test.maf0.05.int0.9.tfam", header = F, stringsAsFactors = F)
admix <- read.table("14.Q", header = T, check.names = F, stringsAsFactors = F, skip = 1)
admix <- admix[match(tfam$V1, admix$`<Trait>`), ]
admix <- cbind(admix[,1], admix[,1], rep(1, nrow(admix)), admix[,-1]) 
write.table(admix, file = "test.emmax.cov.txt", col.names = F, row.names = F, sep = "\t", quote = F)

生成的协变量文件格式如下:

R4155   R4155   1       1e-05   0.864016        0.039019        1e-05   1e-05   1e-05   1e-05   1e-05   1e-05   0.096855        1e-05   1e-05   1e-05
R4156   R4156   1       1e-05   0.135428        1e-05   0.039195        1e-05   0.192606        0.195391        0.062418        1e-05   0.124034        0.214463        1e-05   0.028383
R4157   R4157   1       1e-05   0.613532        1e-05   0.043222        1e-05   1e-05   0.272258        1e-05   1e-05   1e-05   0.058748        1e-05   0.012151
R4158   R4158   1       1e-05   0.419498        0.417594        1e-05   1e-05   1e-05   1e-05   1e-05   1e-05   1e-05   1e-05   0.090515        0.062396
R4159   R4159   1       0.012456        0.04511 1e-05   1e-05   1e-05   0.021304        0.202956        0.049112        0.001099        0.4557  0.011129        0.093843 0.10725

GWAS运行

所有准备文件都有之后,就可以进行GWAS分析了

emmax -v -d 10 -t test.maf0.05.int0.9 -o test_emmax_cov -p test_trait_emmax.txt -k test.maf0.05.int0.9.BN.kinf -c test.cov.emmax.txt

最后生成的test_emmax_cov.ps就是我们需要的

可视化

这里用qqman进行manhattan以及QQ图的绘制

#gwas_visualization.R
setwd("/database/pop-analysis-NY7/result/pop-structure/gwas")
if(!require(qqman)){
    install.packages("qqman")
    library(qqman)
}
if(!require(tidyverse)){
    install.packages("tidyverse")
    library(tidyverse)
}
gwas <- read.table("test_emmax_cov.ps", header = FALSE, stringsAsFactors = FALSE)
colnames(gwas) <- c("SNP","SE","P")
splite_chr <- function(data,c){
  c(str_split(data[c], pattern = "_")[[1]][1])
}
splite_pos <- function(data,c){
  c(str_split(data[c], pattern = "_")[[1]][2])
}
gwas$CHR <- apply(gwas,1,splite_chr,c="SNP")
gwas$BP <- apply(gwas,1,splite_pos,c="SNP")
gwas <- gwas%>%select(SNP,CHR,BP,P)
write.table(gwas,"gwas_emmax_test_for_vis.ps", col.names = TRUE, row.names = FALSE, quote = FALSE)
pdf("test_emmax_manhattan.pdf",width = 12,height = 6)
manhattan(gwas)
dev.off()
pdf("test_emmax_qq.pdf", width = 8, height = 6)
qq(gwas$P)
dev.off()
png("test_emmax_manhattan.png",width = 960,height = 600,type = "cairo")
manhattan(gwas)
dev.off()
png("test_emmax_qq.png", width = 800, height = 800,type = "cairo")
qq(gwas$P)
dev.off()

将上述脚本保存为gwas_visualization.R,之后在linux中运行

Rscript gwas_visualization.R

就可以得到我们需要的图

后续分析

后续还有很多分析,比如,显著位点的提取,关联基因的提取,关联基因的注释等等。

About this app

This app (BnaGWAS) was built by Tao Yan. It’s based on the genomic dataset reported by previous research (Wu et al., 2019).

Tech used in this app

Browse the full source code at https://github.com/YTLogos/gwas.

Packages used in this app