Data mining:ATAC-seq,RNA-seq

9 minute read

Published:

背景

染色质开放性 Chromatin Accessibility

人的DNA链全部展开大约有2m,需要折叠为染色质结构才可以存储到放到细胞核中。染色质的基本结构单位是核小体,核小体再折叠能形成高度压缩的染色质结构。这个过程像我们将文件压缩为zip或者rar的压缩包,只要在使用的时候才会解压出来,平时可以减少它的占用空间。

Fig 1: Chromatin Accessibility

高度折叠的染色质结构在复制和转录时需要暴露出DNA序列,这段暴露的区域就是染色质开放区域,这个区域可以供转录因子和其他调控元件结合,所以它与转录调控是密切相关的。

因此这种致密的核小体结构被破坏后,启动子、增强子等顺式调控元件和反式作用因子可以接近的特性,叫染色质开放性(Chromatin Accessibility)。

为了研究染色质的开放性,目前有MNase-seq,Dnase-seq,ATAC-seq等,但是目前最常用的是2013年由斯坦福大学开发的ATAC-seq。与传统的MNase-seq以及DNase-seq相比,其具有可重复性强,实验步骤简单,需要的实验样本量少等优点,因而被广泛应用1

image-20221219190711398

Fig 2: Methods of Researching Chromatin Accessibility a nd ATAC-seq principle

原理

利用转座酶Tn5会携带特定的已知序列,并且可以结合开放的染色质。Tn5酶对染色质开放区进行打断,在打断的同时加上测序接头,接着进行DNA提取,PCR扩增构建文库。经过测序分析,就可以推断染色质可行性、转录因子结合位点、组蛋白修饰区域和核小体位置。

image-20221219193308860

Fig 3: The Process of Tn5

研究内容

肠上皮化生简称肠化,是指正常的胃黏膜上皮被肠型上皮所取代。

正常情况下,我们的器官各司其职,胃表面生长的是具有分泌胃酸功能的胃黏膜上皮细胞,肠道表面生长的是具有分泌和吸收功能的肠黏膜上皮细胞。但当胃黏膜细胞受到比较严重的损伤后,胃肠黏膜上皮结构出现了一定改变,越长越像邻居家肠黏膜的孩子。看上去就像肠黏膜长错了地方,本该长在肠道上长的结构却出现在了胃黏膜上,就像一片草地长出了树木,树木就显得很突出。

目前的假设是,胃黏膜腺体的颈部干细胞具有多方面分泌的潜能,在正常时它可以分化成各种胃黏膜的成熟上皮细胞[9]干细胞不正常工作时肠化进程会加速,从肠化生过渡到胃癌,而肠化属于胃癌前病变的一种

胃黏膜上皮细胞癌变并非一朝一夕的事情,不是由正常细胞一跃成为癌细胞,而是一个慢性渐进的过程,在发展成恶性肿瘤之前,经历多年持续的癌前变化。若能及早识别和及早干预,也是一种防止胃癌的有效途径

因此从干细胞水平上能够发现促使正常干细胞分化为肠化细胞的根本原因,对于预防胃癌,以及使肠化逆转显得尤为重要。

实验设计:

针对10个病人,分别采集胃,肠化组织,分为两组(stemness + / - )进行培养。

其中阴性对照:正常胃组织,阳性对照:正常十二指肠组织, stemness: 位置细胞干性的条件。+ 维持干性,-不维持,IM: + 。 - 。 胃窦:A, 胃体:C, 胃角:AC

分析

目前已经开源了很多ATAC-seq原始data的预处理与计算,其基本流程为:

QC->Alignment->Remove low quality-> Call Peak

针对Call Peak 的结果,可以计算不同组间差异的Peak,或者Motif 富集与转录因子足迹分析,更进一步的可以联合RNA-seq。

image-20221219191431029

Fig 4: Roadmap of a typical ATAC-seq analysis.

Pipeline:

  1. ENCODE ATAC-seq pipeline

  2. ATAC-seq Guidelines form Harvard

This pipeline is designed for automated end-to-end quality control and processing of ATAC-seq and DNase-seq data.

标准

介绍前期质控指标,避免样本问题对后期实验结果的影响,造成错误或返工

比对率:

正常是超过95%,最低不能低于80%。

# 可以抽取部分样本在nt数据库中进行比对,看map到那些物种中,是否有部分细菌污染
zcat ../data/B63_L4_Q803601.R1.fastq.gz | head -n 1000 >B63_1
zcat ../data/B63_L4_Q803601.R2.fastq.gz | head -n 1000 >B63_2
 
awk '{if(NR%4 == 1){print ">" substr($0, 2)}}{if(NR%4 == 2){print}}' B63_1 > B63_1.fasta
blastn -task blastn -query B63_1.fasta -db /home/nt -num_threads 6 -out unpaired_blastn.aln
# 本地构建db花费时间较多,可以线上。
#B63_atac:73.4
#B60_atac:77.3
#细菌污染

峰值区域的读数比例(FRiP score)

FRiP score应大于0.3,最低不能低于0.2。

所有映射的读数中,属于被称为峰值区域的部分,即显著富集的峰值中的可用读数除以所有可用读数。一般来说,FRiP得分与区域的数量呈正相关.(Landt et al, Genome Research Sept. 2012, 22(9): 1813–1831)

TSS 富集:

TSS富集计算是一种信噪比计算。收集一组参考TSSs周围的读数,形成以TSSs为中心、向任一方向延伸2000bp(共计4000bp)的读数总分布。然后,该分布被归一化,即在分布的每个末端侧翼的100bps内取平均读数深度(总共200bp的平均数据),并计算每个位置相对于该平均读数深度的倍数变化。这意味着侧翼应该从1开始,如果在转录起始位点(基因组的高度开放区域)有高的读数信号,那么信号应该增加,直到中间的一个峰值。我们把这个归一化后的分布中心的信号值作为我们的TSS富集度量。用于评估ATAC-seq。

Fig 5: Transcription Start Site (TSS) Enrichment

Fig 6: Transcription Start Site (TSS) Enrichment Standard Value

文库复杂度测量:

理想状态值是: NRF>0.9, PBC1>0.9, and PBC2>3.

Fig 7: Non-Redundant Fraction etc. Standard Value

Non-Redundant Fraction (NRF) – Number of distinct uniquely mapping reads (i.e. after removing duplicates) / Total number of reads.

PCR Bottlenecking Coefficient 1 (PBC1)

PCR Bottlenecking Coefficient 2 (PBC2)

import os
import pandas as pd
import json
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

path= '/media/wvdon/data/wfy/atac/work/'
dirs = os.listdir(path)
aaa = []
for d in dirs:
    if not str(d).endswith("e"):
        json_path = f'/media/wvdon/data/wfy/atac/work/{d}/qc/qc.json'
        #print(d)
        data = ''
        with open(json_path, 'r') as f:
            data = json.load(f)
            #print(data)
        tss = data["align_enrich"]['tss_enrich']['rep1']['tss_enrich']
        map_read_prc =data["align"]['samstat']['rep1']['pct_mapped_reads']
        NRF = data["lib_complexity"]['lib_complexity']['rep1']['NRF']
        PBC1 = data["lib_complexity"]['lib_complexity']['rep1']['PBC1']
        aaa.append([d,tss,map_read_prc,NRF,PBC1])
an_data = pd.DataFrame(aaa,columns=['B',"tss","map_read_prc","NRF","PBC1"])

sns.kdeplot(an_data['PBC1'],cut=0,cumulative=True,shade=True,color="b")
#sns.kdeplot(an_data['map_read_prc'],cut=0,cumulative=True,shade=True,color="r")
plt.legend(title="PBC1")
plt.show()
an_data['PBC1'].hist()

差异Peak分析

差异peak是分析的第一步,也是基础。根据实验的设计,可以比较两个组之间差异的Peak.

以往的几篇文章都推荐使用Diffbind(Differential binding analysis of ChIP-seq peaksets)

目前没有专门为ATAC设计的差异peak 分析工具,不过他们都是计算该区域的counts数据,归一化,对比两个组之间的差异。

另外HOMER, DBChIP,也能实现同样的需求。

利用Diffbind进行差异Peak分析(PCA,MA,heatmap,Volcano,differ Peak):

library(DiffBind)
csv_path = "/media/wvdon/sdata/atac-seq/after/example5.csv"
dbObj <- dba(sampleSheet=csv_path)

plot(dbObj)

dbcount <- dba.count(DBA = dbObj,bUseSummarizeOverlaps=TRUE,bParallel = FALSE)
save.image("/media/wvdon/sdata/atac-seq/after/atacafter.RData")
load("/media/wvdon/sdata/atac-seq/after/atacafter.RData")


dba_counstrast = dba.contrast(dbcount,categories =
                                DBA_TREATMENT,minMembers = 2)

bdaaly = dba.analyze(dba_counstrast,method = DBA_DESEQ2)


differ_peak_2 = dba.report(bdaaly,bCounts = T)
head(diff_peaks2)
diff_peaks_3= subset(differ_peak_2$Fold>=1 | differ_peak_2$Fold<=1)
#dba.show(dba_counstrast,bContrasts = T)
pma = dba.plotMA(bdaaly,contrast = 1)
#ggsave(file="/media/wvdon/sdata/atac-seq/before/atacMA.svg", plot=pma, width=4, height=4)    
dba.plotVolcano(bdaaly,contrast = 1)

length(differ_peak_2)

#hmap=colorRampPalette(c("blue","white","red"))(n=13)
#readscores=dba.plotHeatmap(bdaaly,contrast = 1,#ColAttributes=c(DBA_TREATMENT,DBA_GROUP),
 #                          main = "DESeq2 Differentially Bound Sites",
  #                         correlations = FALSE,scale='row',colScheme = hmap)

dim(readscores@elementMetadata)

library(dplyr)
diff_peaks2 <- bind_cols(as_tibble(granges(differ_peak_2)), as_tibble(mcols(differ_peak_2)))
library(pheatmap)
Groups=c(rep("IMP",8),rep("IMN",8))
heatmap_peak = differ_peak_2@elementMetadata[7:22]
dim(heatmap_peak)
#write.csv(diff_peaks2, "/media/wvdon/sdata/atac-seq/before/12_18_peak_FDR005.csv")

write.csv(heatmap_peak, "/media/wvdon/sdata/atac-seq/after/12_18heatmap_peak.csv")
data<-read.csv("/media/wvdon/sdata/atac-seq/after/12_18heatmap_peak.csv",header = T,row.names = 1)
#heatmap_peak.columns=Groups
#colnames(heatmap_peak)

annotation_c<-data.frame(Groups)
rownames(annotation_c)<-colnames(data)
colnames(data)
labels_col=c('37AC22','30A51','33A45','46A46','47A47','49AC44','13A06','54A48','37C20','30C04','33C21','46C23','47C07','49C41','13C60','54C62')

p<-pheatmap(data, cluster_rows = F,      #行聚类,列不聚类
            cluster_cols = F,
            show_rownames = F,       #不显示行名
            clustering_distance_rows = "correlation",
            show_colnames = T,      #显示列明 angle_row="15",行名旋转15度,列明相似
            
            annotation_col = annotation_c,  #对列进行注释即对列进行分组
            #na_col = "white",
            scale = "row",   #将数据按行进行标准化
            
            #设置格子大小 cellheigt=""设置格子高
            
            #设置格子高
            labels_col= labels_col,
            angle_col = 90,
            border=F
            ,color = colorRampPalette(colors= c("blue","white","red"))(10) 
            #,color = colorRampPalette(c("#FFFF00","#FF0000"))(100)
)   
p
library("ggplot2")
#some sample data
#BiocManager::install('svglite')
#This actually save the plot in a image
ggsave(file="/media/wvdon/sdata/atac-seq/after/12_18heatmap.svg", plot=p, width=8, height=8)

Peak 注释

import argparse
import math
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import sys
sys.path.append(r'/home/wvdon/BIO_ATAC/')
from common.pyShell import  runShell  
import os

def shellAccept():
    '''
    预定义命令行参数,接收并存储
    必须参数:None
    可选参数:
    -u / --URL
    -t / --threads
    -v / --version
    @return:返回获取到的命令行参数args,以数据字典格式
    '''
    try:    # 异常处理
        parser = argparse.ArgumentParser(description="peak 基因注释")
        #@todo 临时写成测试文件。
        parser.add_argument("-u", "--csv",required=False, type=str, help="peak csv path",default='/media/wvdon/sdata/test/12_19_peak_FDR005_remove4.csv')
        parser.add_argument("-v", "--version", type=str, help="工具版本号:V1.0")
        parser.add_argument("-output","--output", type=str,default='./annotate',help='输出路径')
        parser.add_argument("-annotatePeaks",default='/usr/local/share/bio/homer/bin/annotatePeaks.pl',help='-annotatePeaks.pl 执行路径')
        args = parser.parse_args()  # 获取参数字典
        return args
    except Exception as e:
        print(e)
def peakAnnotion(csv_path,work_path):
    bed_path=os.path.join(work_path,'outputPeak.bed')
    df = pd.read_csv(csv_path).iloc[:,1:11]
    
    df.columns = ['Chromosome', 'Start', 'End','width','Strand','C','N','P','Fold','pv']
    df[['Chromosome', 'Start', 'End','Strand']].to_csv(bed_path,header=None,sep='\t',index=0)
    
    shell = f'{args.annotatePeaks}  {bed_path} hg38 > {work_path}/outputPeakAnnotate.txt'
    print(f'exectue shell {shell}')
    status_code = runShell(shell,timeout = 120)
    if status_code==0:
        print(f'success annotate,export Peak Annotate file :{work_path}/outputPeakAnnotate.txt')
    else:
       print(f"Exectue '{shell}' Failed") 
    gene = pd.read_table(f'{work_path}/outputPeakAnnotate.txt',sep='\t')
    ids_list = []
    for k in gene.iloc[:,0]:
        if len(str(k))>1:
            ids_list.append(int(str(k)[2:])-1)
        else:
            ids_list.append(0)
    gene['ids'] = ids_list
    gene.sort_values('ids').to_csv(f'{work_path}/peak_outputPeakAnnotate_sorted.csv')
    
    atac_gene = gene.sort_values('ids')
    atac_gene['Fold']=df['Fold']

    output_gene_path = f'{work_path}/peak_outputPeakAnnotate_sorted_conact.csv'
    atac_gene.to_csv(output_gene_path)
    list_annotion = []
    for an in atac_gene['Annotation']:
        list_annotion.append(str(an).split(' (')[0])
    dict={}
    for key in list_annotion:
        dict[key]=dict.get(key,0)+1
    peakAnnotionPie(work_path,dict)
    peakUpDownPieAndBar(work_path,output_gene_path)
    return output_gene_path
def peakUpDownPieAndBar(work_path,data_path):
    data = pd.read_csv(data_path)
    fig = plt.figure()
    x = np.arange(0,math.pi*2,0.05)
    up=data[data['Fold']>0]['Fold']
    down = data[data['Fold']<0]['Fold']

    if len(up) > len(down):
        up_bins = 200
        down_bins = int(2000/len(up)*len(down))
        exp = (0,0.5)
        sits = 221
    else:
        down_bins = 200
        up_bins = int(2000/len(down)*len(up))
        ex = (0.5,0)
        sits = 222
    ax1 = fig.add_subplot(111)
    ax1.hist(down,bins=down_bins,color='orange')
    ax1.hist(up,bins=up_bins,color='red')
    ax2 = fig.add_subplot(sits,facecolor='r')
    ax2.pie([len(up),len(down)],shadow=True,colors=['orange','red'],explode=exp,labels=['colsed','open'],autopct='%1.1f%%')
    ax2.set_title(f'Total:{len(data)}')
    plt.savefig(f'{work_path}/percent_atac_pie.svg',dpi=300)
    print('plot annotion pie_bar done!')
    
def peakAnnotionPie(work_path,dict):
    expodes = (0,0,0.1,0,0,0,0,0.1)
    colors = ['red','orange','yellow','green','purple','blue','black','brown']
    plt.pie(dict.values(),explode=expodes,labels=dict.keys(),shadow=True,colors=colors,autopct='%1.1f%%')
    ## 用于显示为一个长宽相等的饼图
    plt.axis('equal')
    #保存并显示
    plt.savefig(f'{work_path}/pie_annotion.svg',dpi=300)
    print('plot annotion pie done!')


if __name__ == '__main__':
    root_path = os.getcwd()
    
    args = shellAccept()
    work_path = args.output
    csv_path = args.csv
    if not os.path.exists(work_path):
        os.makedirs(work_path)
    output_gene_path = peakAnnotion(csv_path,work_path)
import subprocess
class pyShell():
    def __init__(self) -> None:
        pass
    '''
    return 0 : Success , else: Fail
    '''
def runShell(command,timeout=5):
    ret = subprocess.run(command,shell=True,stdout=subprocess.PIPE,stderr=subprocess.PIPE,encoding="utf-8",timeout=timeout)
    return ret.returncode 

Motif

peak注释虽然提供了功能解释,但并没有直接解释底层机制。开放的染色质可以通过转录因子影响转录,转录因子通过识别和结合 DNA 上的特定序列(TFBS:TF 结合位点)来促进转录。而事实上转录因子通过与组蛋白或非组蛋白的竞争以及与辅因子的合作来调节转录。

有两种类型的基序或基于 TF 的分析方法(研究TF调控):

  • 基序频率或活动的基于序列的预测
  • TF 占用的足迹。

JASPAR是现在用的最多的一个motif 数据库,事实上存的就是一些转录因子对应的位置权重矩阵(PWM),其中有的是实验的结果,还有的是计算预测出来的。

工具:

  • TFBSTools
  • HOMER
  • MEME FIMO

原理都是一样的,基于PWM矩阵,然后在序列里面扫描搜索。

一直有一个疑问,对于motif扫描的时候,我们是应该用差异的区域,还是全部的区域?。差异的区域中全部的还是仅上调的区域?

/usr/local/share/bio/homer/bin/findMotifsGenome.pl out.bed hg38 first -len 6,8,10,12,14

对于hommer,其result 有两个,一部分是能够和已有数据库中,匹配到的,另外一部分是基于序列预测出来了的,可能没有任何的生物学意义。

对于HINT 也发现能够做motif 富集。

rgt-motifanalysis --enrichment --organism mm9 --input-matrix Matrix_CDP_cDC.txt match/random_regions.bed

TF Footprints

除了motif,TF Footprints是另外一种研究转录因子调控的方法。原理是TF 与 DNA 结合会阻止结合位点内的 Tn5 切割,就会形成一个深渊低谷一样的峰值分布。

对于检测方法,目前都是基于Boyle 提出的变种隐马尔可夫模型HMM,即在每个碱基使用归一化和平滑的片段计数来检测不同的状态,例如足迹、侧翼和背景。其中目前用的比较多的是针对ATAC数据的HINT-ATAC。

最近的 HINT-ATAC 也使用 HMM,但只有 HINT-ATAC 校正了链特异性 Tn5 切割偏差.

Hint install introduction

# need bam and bed file for input 
## rep twice
rgt-hint footprinting --atac-seq --paired-end --output-prefix=fp_paired ATAC.bam ATACPeaks.bed
gt-motifanalysis matching --organism=hg38 --input-files IMN.bed IMP.bed  --output-location motif
rgt-hint differential --organism=hg38 --bc --nc 16 --mpbs-files=motif/IMN_mpbs.bed,motif/IMP_mpbs.bed --reads-files=IMN.bam,IMP.bam --conditions=IMN,IMP --output-location=tfprinting

@todo 针对单组数据,对于重复数据,两组,还代解决code。

RNA-seq 联合分析

通过 RNA-seq 定性或定量地将染色质可及性的变化与感兴趣的基因表达的变化联系起来,直观地,我们可以发现 DE 基因是否在相应的 TSS 周围也具有显着差异的染色质可及性,可以推断 DE 基因受与开放染色质中特定基序或足迹相关的 TF 调节.

WCGAN

https://mp.weixin.qq.com/s/mT7ttd5y0Z-rG93YoeGPXQ

PPI

案例:

1. PECA:转录因子TF,染色质调控因子CR和调控元件RE相互作用网络推断的新方法Cell Stem Cell 2019 )

image-20221222165742726

Fig: Schematic overview of the method for constructing TF-chromatin transcriptional regulatory network

可以使用 PECA[7] 方法重建调控网络。中科院王勇教授团队,利用匹配的基因表达和染色质可及性数据刻画转录因子和调控元件结合调控下游基因表达的数学模型,构建了描绘细胞状态转化的染色质调控网络,通过网络分析鉴定出TFAP2C和p63分别为表面外胚层起始和角质形成细胞成熟的关键因子.

PECA Github

2. 鸡胚的体节分化过程,挖掘关键的TF和Enhancer(nature communications 2021)

3. 揭示酒精诱导的抗焦虑过程中的表观基因组学和转录组学相互作用(Molecular P s ychiatry 2022)

Fig.n This model depicts the ability of acute ethanol to rapidly alter the epigenome in the amygdala and produce transcriptomic change

featureCounts

featureCounts -T 16 -p -t exon -g gene_id -a /home/wvdon/atac/gene/Homo_sapiens.GRCh38.106.gtf -o all_new_feature.txt \
/media/wvdon/MY-datas/Release_Datas_20210429/mRNA/bams/B87.sorted.bam

RNA-seq 数据分析(差异基因,火山图,热图,富集分析)

base <- read.table("/media/wvdon/sdata/atac-seq/before/all_new_feature.txt",row.names = 1 ,header=T,sep = '\t')
basedata=base[6:ncol(base)]
group_list <- factor(c(rep("IM",8), rep("NC",8)))
table(group_list)
colData <- data.frame(row.names=colnames(basedata),group_list=group_list)
head(basedata)
colnames(colData)
ncol(basedata)
nrow(colData)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = basedata,colData = colData,
                              design = ~ group_list)
dds_2 <- DESeq(dds)
resultsNames(dds_2)
res <- results(dds_2)
diff_gene_deseq2 <-subset(res, padj < 0.05 )

up_DEG <- subset(res, padj < 0.05 & log2FoldChange > 1)
down_DEG <- subset(res, padj < 0.05 & log2FoldChange < -1)
dim(up_DEG)
dim(down_DEG)

#BiocManager::install("biomaRt")
library(biomaRt)
human <- useMart('ensembl',dataset = "hsapiens_gene_ensembl")
gene_id = row.names(diff_gene_deseq2)
gene_name<-getBM(attributes=c("ensembl_gene_id","external_gene_name"),filters = "ensembl_gene_id",values =gene_id , mart = human)

# RNA-seq,火山图, heatmap
library(ggplot2)
library(ggrepel)  #用于标记的包

library(pheatmap)
row.names(assay(dds_2))
head(assay(dds_2))
rownames(gene_name)=gene_name$ensembl_gene_id
final_rna =merge(gene_name,assay(dds_2), by = "row.names",all=F)
dim(assay(dds_2)) # =103 

head(final_rna)
heat_map_data=final_rna[4:ncol(final_rna)]
dim(heat_map_data)

Groups=c(rep("IM",8),rep("NC",8))
annotation_c<-data.frame(Groups)
annotation_c<-data.frame(Groups)
rownames(annotation_c)<-colnames(heat_map_data)
data<-log2(heat_map_data+1)
colnames(data)
#rownames(data)=final_rna$external_gene_name
ac = c("37AC73","30A72","33A37","46A94","47A84","49AC90","13A87","54A80",
"37C74","30C78","33C82","46C81","47C79","49C93","13C64","54C62")
p<-pheatmap(data, cluster_rows = T,      #行聚类,列不聚类
            cluster_cols = F,
            show_rownames = F,       #不显示行名
            
            show_colnames = T,      #显示列明 angle_row="15",行名旋转15度,列明相似
            
            annotation_col = annotation_c,  #对列进行注释即对列进行分组
            #na_col = "white",
            scale = "row",   #将数据按行进行标准化
            
            #设置格子大小 cellheigt=""设置格子高
            
            #设置格子高
            labels_col= ac,
            angle_col = 30,
            border=F
            ,color = colorRampPalette(colors= c("blue","white","red"))(10) 
            #,color = colorRampPalette(c("#FFFF00","#FF0000"))(100)
)  
p
ggsave(file="/media/wvdon/sdata/atac-seq/before/heatmap_rna.svg", plot=p, width=8, height=8)



# 火山图


final_rna_vo =merge(gene_name,DataFrame(diff_gene_deseq2), by = "row.names",all=F)
write.csv(final_rna_vo,'/media/wvdon/sdata/atac-seq/before/vo.csv')

data<-read.csv('/media/wvdon/sdata/atac-seq/before/vo.csv')
data$log2FoldChange=-data$log2FoldChange
cut_off_pvalue=0.05
data$external_gene_name
PvalueLimit = 5
data$label=ifelse(-log10(data$pvalue) > PvalueLimit , as.character(data$external_gene_name), '')
data$group<-as.factor(ifelse(data$pvalue <= 0.05 & abs(data$log2FoldChange) >=0,
                             ifelse(data$log2FoldChange<=0  ,'down','up'),'NS'))

this_tile <- paste0('Cutoff for logFC is abs 1.0 and pvalue is 0.05',
                    '\nThe number of up gene is 59',
                    '\nThe number of down gene is 43')                 

p <- ggplot(
  #设置数据
  data, 
  aes(x = log2FoldChange, 
      y = -log10(pvalue), 
      colour=group)) +
  geom_point(alpha=0.4, size=3.5) +
  ggtitle( this_tile ) +
  #scale_fill_manual(values=c("#d2dae2","#546de5"))+
  
  # 辅助线
  geom_vline(xintercept=c(-1,1),lty=4,col="black",lwd=0.8) +
  geom_hline(yintercept = -log10(cut_off_pvalue),lty=4,col="black",lwd=0.8) +
  
  # 坐标轴
  labs(x="log2(fold change)",
       y="-log10 (p-value)")+scale_color_manual(values = c( 'green','red'))+
  theme_bw()+ 
  
  #theme(plot.title = element_text(vjust = -30,size=12), 
  #      legend.position="right", 
  #      legend.title = element_blank(),
  #)+
  geom_text_repel(aes(x = log2FoldChange,                   # geom_text_repel 标记函数
                      y = -1*log10(pvalue),          
                      label=label),                       
                  max.overlaps = 10000,                    # 最大覆盖率,当点很多时,有些标记会被覆盖,调大该值则不被覆盖,反之。
                  size=5,                                  # 字体大小
                  box.padding=unit(0.5,'lines'),           # 标记的边距
                  point.padding=unit(0.1, 'lines'), 
                  segment.color='black',      
                  show.legend=FALSE)   

# 图例
#p+ggtitle(this_tile)
p
ggsave(file="/media/wvdon/sdata/atac-seq/before/vo_rna.svg", plot=p, width=8, height=8)
write.csv(data,'/media/wvdon/sdata/atac-seq/before/p005_filter_rna_before.csv')

human <- useMart('ensembl',dataset = "hsapiens_gene_ensembl")
gene_id = row.names(res)
gene_name<-getBM(attributes=c("ensembl_gene_id","external_gene_name"),filters = "ensembl_gene_id",values =gene_id , mart = human)

res$log2FoldChange=-res$log2FoldChange
rownames(gene_name)=gene_name$ensembl_gene_id
all_before_rna = merge(gene_name,DataFrame(res), by = "row.names",all=F)

write.csv(all_before_rna,'/media/wvdon/sdata/atac-seq/before/all_before_rna.csv')

总结

Fig. The Summary of ATAC & RNA-seq

参考

  1. Yan, Feng, et al. “From reads to insight: a hitchhiker’s guide to ATAC-seq data analysis.” Genome biology 21.1 (2020): 1-16.
  2. Krishnan, H. R. et al. Unraveling the epigenomic and transcriptomic interplay during alcohol-induced anxiolysis. Mol. Psychiatry (2022) doi:10.1038/s41380-022-01732-2.
  3. Mok, G. F. et al. Characterising open chromatin in chick embryos identifies cis-regulatory elements important for paraxial mesoderm formation and axis extension. Nat. Commun. 12, 1157 (2021).
  4. RS ∗, GB †. DiffBind: Differential binding analysis of ChIP- Seq peak data.
  5. Li, Z., Schulz, M. H., Look, T., Begemann, M., Zenke, M., & Costa, I. G. (2019). Identification of transcription factor binding sites using ATAC-seq. Genome Biology, 20(1), 45.
  6. Duren Z, Chen X, Xin J, et al. Time course regulatory analysis based on paired expression and chromatin accessibility data[J]. Genome research, 2020, 30(4): 622-634.
  7. Li, Lingjie, et al. “TFAP2C-and p63-dependent networks sequentially rearrange chromatin landscapes to drive human epidermal lineage commitment.” Cell Stem Cell 24.2 (2019): 271-284
  8. Quinlan, AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010;26:841-842
  9. “肠化”到底是怎么回事?什么情况下会癌变?https://view.inews.qq.com/a/20210205A0CTTC00

Software Version

macs2 2.2.4 
bwa 0.7.17 
bowtie2 2.3.4.3 
pipeline (v2.1.3) 
Homer 
HINT-ATAC 
  
  
  

所有的代码都被上传到GitHub,https://github.com/wvdon/BIO_ATAC

目前是个人私人仓库,后续会开放。