The R package splice2neo provides multiple functions for the analysis of splice junctions. The main focus is on the association of splice junctions with somatic mutations. The package integrates the output of several tools that predict splicing effects from mutations. It annotates these mutation effects with the resulting splicing event (alternative 3’/5’ splice sites, exons skipping, intron retentions) and the resulting splice junctions in a standardized format. Expressed splice junctions detected by external tools from RNA-seq data can be parsed and integrated. Splice junctions can be filtered against canonical splice junctions and annotated with the altered transcript sequences, CDS, and resulting peptide sequences. Splice2neo was mainly developed to support the detection of neoantigen candidates from tumor-specific splice junctions. Neoantigens are tumor-specific mutated peptides recognized by T-cells and utilized in immunotherapies against tumors (Lang et al. 2022). However, most functions from the splice2neo package are also useful for other types of splicing analysis.
In this vignette, we use use some toy example data to demonstrate how splice2neo can be used to analyze splice-junctions which are derived from mutation predictions as tumor-specific neoantigen candidates (Lang et al. 2023).
The R package splice2neo is not yet on CRAN or Bioconductor. Therefore, you need to install splice2neo from github.
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("TRON-Bioinformatics/splice2neo")
Integrating splice2neo functions and detection rules based on splice effect scores and RNA-seq support facilitates the identification of mutation-associated splice junctions which are tumor-specific and can encode neoantigen candidates. The following workflow explains how to use splice2neo to predict neoantigen candidates from mutation-associated splice junctions using example data.
A database of transcripts is required for transcript annotation with splice2neo. Here, we show how to build the database from a GTF file that is a subset of GENCODE annotations, but users can build these databases also from other resources.
# use gtf file of choice and transform into transcript database
gtf_file <- system.file("extdata/example/gencode.v34lift37.annotation.subset.gtf.gz", package="splice2neo")
# parse GTF file as txdb object
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file)
#> Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...): makeTxDbFromGFF() has moved to the txdbmaker package. Please call
#> txdbmaker::makeTxDbFromGFF() to get rid of this warning.
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
#> stop_codon. This information was ignored.
#> OK
# optionally save database for later re-use
# saveDb(txdb, file = "/path/to/transripts/txdb.sqlite")
We can retrieve the canonical splice junctions between all adjacent
exons from a list of transcripts with
canonical_junctions()
.
ref_junc <- canonical_junctions(transcripts)
str(ref_junc)
#> chr [1:1316] "chr2:28680064-28739675:+" "chr2:28739736-28741333:+" ...
In splice2neo, splice junctions are represented by a
junc_id
, which is defined by the genomic coordinates of the
exon positions (1-based) that are combined through the junction and the
transcription strand of the transcript:
chr:pos1-pos2:strand
.
This format allows direct conversion to GRangs
objects
from the GenomicRanges
package.
gr <- GenomicRanges::GRanges(ref_junc)
gr
#> GRanges object with 1316 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr2 28680064-28739675 +
#> [2] chr2 28739736-28741333 +
#> [3] chr2 28741399-28742572 +
#> [4] chr2 28742630-28748134 +
#> [5] chr2 28748174-28748772 +
#> ... ... ... ...
#> [1312] chrX 76949322-76949323 -
#> [1313] chrX 76944394-76944395 -
#> [1314] chrX 76952175-76952176 -
#> [1315] chrX 77041690-77041691 -
#> [1316] chrX 76953090-76953091 -
#> -------
#> seqinfo: 13 sequences from an unspecified genome; no seqlengths
A given GRanges object with ranges of splice junctions can be converted into a vector of `junc_id``s:
junc_id <- as.character(gr)
head(junc_id)
#> [1] "chr2:28680064-28739675:+" "chr2:28739736-28741333:+"
#> [3] "chr2:28741399-28742572:+" "chr2:28742630-28748134:+"
#> [5] "chr2:28748174-28748772:+" "chr2:28748812-28752184:+"
Splice2neo transforms splice events identified in RNA-seq into a standardized junction format. We show here how to parse the tools LeafCutter and SplAdder.
The output of the tool LeafCutter (Li et al. 2018) can be transformed using
leafcutter_transform()
:
leafcutter_path <- system.file("extdata/example/leafcutter", package="splice2neo")
leafcutter <-
leafcutter_transform(path = leafcutter_path)
#> Importing Leafcutter files...
#> Importing Leafcutter files completed
head(leafcutter)
#> # A tibble: 6 × 7
#> junction_start junction_end strand chromosome Gene class junc_id
#> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr>
#> 1 39934433 39937097 - chrX NA NA chrX:39934433-39937…
#> 2 76944420 76953071 - chrX NA NA chrX:76944420-76953…
#> 3 18026082 18029488 - chr11 NA NA chr11:18026082-1802…
#> 4 20921104 20922809 + chr22 NA NA chr22:20921104-2092…
#> 5 153333024 153456093 - chr4 NA NA chr4:153333024-1534…
#> 6 32084907 32085100 - chr6 NA NA chr6:32084907-32085…
The output of the tool SplAdder (Kahles et al. 2016) can be transformed using
spladder_transform()
:
spladder_path <- system.file("extdata/example/spladder", package="splice2neo")
spladder <-
spladder_transform(path = spladder_path)
#> Importing Spladder files ...
#> Importing Spladder files completed
head(spladder)
#> # A tibble: 6 × 8
#> junction_start junction_end strand chromosome Gene class junc_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 20921104 20922808 + chr22 ENSG00000099917.1… A3SS chr22:…
#> 2 20921104 20922809 + chr22 ENSG00000099917.1… A3SS chr22:…
#> 3 41108521 41109155 - chr17 ENSG00000266967.7… A3SS chr17:…
#> 4 41108577 41109155 - chr17 ENSG00000266967.7… A3SS chr17:…
#> 5 148881736 148884821 + chr3 ENSG00000163755.9… A3SS chr3:1…
#> 6 148881736 148884852 + chr3 ENSG00000163755.9… A3SS chr3:1…
#> # ℹ 1 more variable: number_supporting_reads <dbl>
Now we want to get a single data.frame
of all splice
junctions identified by RNA-seq tools and information by which tool they
were identified.
rna_junctions <-
generate_combined_dataset(list("spladder_juncs" = spladder, "leafcutter_juncs" = leafcutter))
head(rna_junctions)
#> # A tibble: 6 × 10
#> junc_id junction_start junction_end strand chromosome identified_by_leafcu…¹
#> <chr> <chr> <chr> <chr> <chr> <lgl>
#> 1 chr1:266… 26627515 26628185 - chr1 TRUE
#> 2 chr1:849… 84964231 84971694 - chr1 TRUE
#> 3 chr10:82… 82267130 82269057 + chr10 FALSE
#> 4 chr10:82… 82267130 82271900 + chr10 FALSE
#> 5 chr10:82… 82269227 82271900 + chr10 FALSE
#> 6 chr11:18… 18026082 18029488 - chr11 TRUE
#> # ℹ abbreviated name: ¹identified_by_leafcutter_juncs
#> # ℹ 4 more variables: identified_by_spladder_juncs <lgl>, Gene <chr>,
#> # class <chr>, spladder_juncs_number_supporting_reads <dbl>
Deep learning tools can predict the effect of a mutation on the
disruption or creation of canonical splicing sequence motifs. This is
typically provided as gain or loss of splicing acceptor or donor sites.
Splice2neo can transform and annotate such mutation effects into a
unified splice junction format. Currently, the transformation of
predictions by SpliceAI (Jaganathan et al. 2019), MMSplice(Cheng et al. 2019) and Pangolin(Zeng and Li 2022) are supported.
Here, we show how to annotate results from SpliceAI and MMSplice. The
annotation of Pangolin results works analogously with SpliceAI.
Effect predictions by SpliceAI can be read with
parse_spliceai()
spliceai_file <- system.file("extdata/example/spliceAI.vcf", package="splice2neo")
# parse SpliceAI results
spliceai <-
parse_spliceai(spliceai_file)
head(spliceai)
#> # A tibble: 6 × 18
#> CHROM POS ID REF ALT QUAL FILTER Key ALLELE SYMBOL DS_AG DS_AL
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr> <dbl> <dbl>
#> 1 chr10 8006689 NA C T NA PASS 1 T TAF3 0 0
#> 2 chr10 91528535 NA G A NA PASS 2 A KIF20B 0.64 0.22
#> 3 chr11 18028133 NA C T NA PASS 3 T SERGEF 0 0
#> 4 chr11 62985208 NA C T NA PASS 4 T SLC22A… 0.06 0.55
#> 5 chr16 27709648 NA G A NA PASS 5 A KIAA05… 0.97 1
#> 6 chr16 30020302 NA G A NA PASS 6 A DOC2A 0 0
#> # ℹ 6 more variables: DS_DG <dbl>, DS_DL <dbl>, DP_AG <int>, DP_AL <int>,
#> # DP_DG <int>, DP_DL <int>
Next, SpliceAi effects can be transformed into a standardized effect
dataframe with format_spliceai()
. Here, we can optionally
provide a data.frame
(gene_table
) which
relates gene symbols with GENCODE gene ids. If gene_table
is provided, the formatted data.frame
contains the GENCODE
gene ids. While this is advised to better select relevant transcripts,
it is not required.
gene_table <-
readr::read_tsv(system.file("extdata/example/gene_table.tsv", package =
"splice2neo"))
#> Rows: 17 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): strand, gene_id, gene_name
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(gene_table)
#> # A tibble: 6 × 3
#> strand gene_id gene_name
#> <chr> <chr> <chr>
#> 1 - ENSG00000134313.16_8 KIDINS220
#> 2 + ENSG00000163803.13_8 PLB1
#> 3 - ENSG00000163510.14_5 CWC22
#> 4 - ENSG00000136237.19_8 RAPGEF5
#> 5 + ENSG00000147316.13_4 MCPH1
#> 6 + ENSG00000138182.14_3 KIF20B
# format SpliceAI results
splicai_formatted <-
format_spliceai(spliceai, gene_table = gene_table)
head(splicai_formatted)
#> # A tibble: 6 × 7
#> mut_id effect score chr pos_rel pos gene_id
#> <chr> <fct> <dbl> <chr> <int> <int> <chr>
#> 1 chr10_91528535_G_A AG 0.64 chr10 2 91528537 ENSG00000138182.14_3
#> 2 chr10_91528535_G_A AL 0.22 chr10 -40 91528495 ENSG00000138182.14_3
#> 3 chr11_18028133_C_T DG 0.09 chr11 42 18028175 ENSG00000129158.11_6
#> 4 chr11_18028133_C_T DL 0.15 chr11 5 18028138 ENSG00000129158.11_6
#> 5 chr11_62985208_C_T AG 0.06 chr11 -2 62985206 ENSG00000196600.12_6
#> 6 chr11_62985208_C_T AL 0.55 chr11 -1 62985207 ENSG00000196600.12_6
Predicted SpliceAI effects can be annotated with all potential
resulting junctions and transcripts with
annotate_mut_effect()
. If the input was formatted using
gene_table
, the user may want to choose
gene_mapping = TRUE
which removes unlikely events from the
results. Otherwise, the default gene_mapping = FALSE
should
be used. The resulting dataframe contains all potential alternative
transcripts per row, meaning that there can be multiple annotated
transcripts per junction and also multiple junctions per mutation
effect. At this point, no filtering based on effect size has been
applied meaning that the dataframe contains unlikely events.
# annotate SpliceAI junctions with all potential junctions and transcripts
spliceai_annotated <-
annotate_mut_effect(
effect_df = splicai_formatted,
transcripts = transcripts,
transcripts_gr = transcripts_gr,
gene_mapping = TRUE
)
#> INFO: calculate coordinates of upstream and downstream exons...
#> INFO: get overlapping transcripts...
#> INFO: Build df ...
#> INFO: build pairs...
#> INFO: find overlapping exons...
#> INFO: calculate junction coordinates from predicted effect...
#> INFO: Evaluation of rules done.
spliceai_annotated %>%
dplyr::select(mut_id, junc_id, gene_id, tx_id, effect, score) %>%
head()
#> # A tibble: 6 × 6
#> mut_id junc_id gene_id tx_id effect score
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG 0.64
#> 2 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG 0.64
#> 3 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG 0.64
#> 4 chr10_91528535_G_A chr10:91528494-91528495:+ ENSG000001381… ENST… AL 0.22
#> 5 chr10_91528535_G_A chr10:91528148-91532446:+ ENSG000001381… ENST… AL 0.22
#> 6 chr10_91528535_G_A chr10:91528494-91528495:+ ENSG000001381… ENST… AL 0.22
It can happen that multiple effects lead to the same altered
transcripts (defined by mut_id
, junc_id
,
tx_id
). The function unique_mut_junc()
can be
used to filter for unique altered transcripts by keeping the effect with
the highest effect score.
# unique altered transcripts
spliceai_annotated_unique <- unique_mut_junc(spliceai_annotated)
nrow(spliceai_annotated_unique) < nrow(spliceai_annotated)
#> [1] TRUE
We can read the effect predictions by MMSplice with
parse_mmsplice()
mmsplice_file <- system.file("extdata/example/MMsplice.csv", package="splice2neo")
# parse MMSplice results
mmsplice <-
parse_mmsplice(mmsplice_file)
head(mmsplice)
#> # A tibble: 6 × 20
#> ID exons exon_id gene_id gene_name transcript_id delta_logit_psi
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1 ENST00000327… 0.00444
#> 2 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1 ENST00000422… 0.00444
#> 3 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1 ENST00000404… 0.00444
#> 4 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5 ENST00000624… 0.0464
#> 5 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5 ENST00000436… 0.0464
#> 6 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5 ENST00000641… 0.0464
#> # ℹ 13 more variables: ref_acceptorIntron <dbl>, ref_acceptor <dbl>,
#> # ref_exon <dbl>, ref_donor <dbl>, ref_donorIntron <dbl>,
#> # alt_acceptorIntron <dbl>, alt_acceptor <dbl>, alt_exon <dbl>,
#> # alt_donor <dbl>, alt_donorIntron <dbl>, pathogenicity <dbl>,
#> # efficiency <dbl>, mut_id <chr>
The MMSplice data can be annotated with all potential splice
junctions using annotate_mmsplice()
mmsplice_annotated <- mmsplice %>%
annotate_mmsplice(transcripts = transcripts)
head(mmsplice_annotated)
#> # A tibble: 6 × 22
#> ID exons exon_id gene_id gene_name tx_id delta_logit_psi ref_acceptorIntron
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> 2 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> 3 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> 4 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> 5 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> 6 chr2… chr2… ENSE00… ENSG00… PLB1 ENST… 0.00444 -4.31
#> # ℹ 14 more variables: ref_acceptor <dbl>, ref_exon <dbl>, ref_donor <dbl>,
#> # ref_donorIntron <dbl>, alt_acceptorIntron <dbl>, alt_acceptor <dbl>,
#> # alt_exon <dbl>, alt_donor <dbl>, alt_donorIntron <dbl>,
#> # pathogenicity <dbl>, efficiency <dbl>, mut_id <chr>, event_type <chr>,
#> # junc_id <chr>
Junctions predicted from MMSplice results can contain duplicated
altered transcripts for exon inclusion events. We use the function
unique_junc_mmsplice()
to filter for unique altered
transcripts by keeping the effect with the highest effect score.
# unique altered transcripts
mmsplice_annotated_unique <- unique_junc_mmsplice(mmsplice_annotated)
nrow(mmsplice_annotated_unique) < nrow(mmsplice_annotated)
#> [1] FALSE
Next, we combine splice junctions from SpliceAI and MMSplice into one
dataframe using combine_mut_junc()
. This dataframe will
contain unique altered transcripts (defined by mut_id
,
junc_id
, tx_id
) in rows. Tool specific
information are defined in columns starting with the respective tool
name.
mut_junctions <- combine_mut_junc(list(
"spliceai" = spliceai_annotated_unique,
"mmsplice" = mmsplice_annotated_unique
))
mut_junctions %>%
dplyr::select(mut_id, junc_id, tx_id, spliceai_event_type, spliceai_score) %>%
head()
#> # A tibble: 6 × 5
#> mut_id junc_id tx_id spliceai_event_type spliceai_score
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 chr10_8006689_C_T chr10:7866523-801… ENST… NA NA
#> 2 chr10_91528535_G_A chr10:91528148-91… ENST… alternative 3prim 0.64
#> 3 chr10_91528535_G_A chr10:91528148-91… ENST… exon skipping 0.22
#> 4 chr10_91528535_G_A chr10:91528494-91… ENST… intron retention 0.22
#> 5 chr10_91528535_G_A chr10:91528148-91… ENST… alternative 3prim 0.64
#> 6 chr10_91528535_G_A chr10:91528148-91… ENST… exon skipping 0.22
To exclude junctions expressed in healthy tissues, we here want to
filter predicted junctions for non-canonical splice junctions. We can
use for instance exon-exon junctions from GENCODE and junctions found in
healthy tisssue samples. Splice2neo also provides the function
bed_to_junc()
which allows to transform a bed file into a
list of splice junctions.
# this is just an example file with a few canonical junctions --> incomplete
canonical_juncs_file <- system.file("extdata/example/test_canonical.bed", package="splice2neo")
canonical_junctios <- bed_to_junc(bed_file = canonical_juncs_file, type = "exon-exon")
Annotate mutation-retrieved splice junction whether they are canonical splice junctions.
# annotated canonical junctions
mut_junctions <- mut_junctions %>%
mutate(is_canonical = is_canonical(junc_id, ref_junc = canonical_junctios, exons_gr = transcripts))
mut_junctions %>%
count(is_canonical)
#> # A tibble: 2 × 2
#> is_canonical n
#> <lgl> <int>
#> 1 FALSE 202
#> 2 TRUE 124
# remove canonical junctions
noncan_mut_junctions <- mut_junctions %>%
filter(!is_canonical)
Annotate whether mutation-retrieved splice junction were also detected by a RNA-seq tool.
noncan_mut_junctions <- noncan_mut_junctions %>%
mutate(is_in_rnaseq = is_in_rnaseq(junc_id, rna_juncs = rna_junctions$junc_id))
Splice2neo can annotate the resulting transcript sequence for each
altered transcript with add_context_seq()
. Here, we set the
size
parameter to 400 to extract the sequence of +/-200
nucleotides around the splice junctions as context sequence.
noncan_mut_junctions <- noncan_mut_junctions %>%
add_context_seq(
size = 400,
bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
transcripts = transcripts
)
noncan_mut_junctions %>%
dplyr::select(mut_id, junc_id, tx_id, cts_id, cts_seq) %>%
head()
#> # A tibble: 6 × 5
#> mut_id junc_id tx_id cts_id cts_seq
#> <chr> <chr> <chr> <chr> <chr>
#> 1 chr10_8006689_C_T chr10:7866523-8019204:+ ENST00000344293.6… dd63d… CCAGCT…
#> 2 chr10_91528535_G_A chr10:91528148-91528537:+ ENST00000260753.8… 83411… AGCCAA…
#> 3 chr10_91528535_G_A chr10:91528148-91532446:+ ENST00000260753.8… f7ae9… AGCCAA…
#> 4 chr10_91528535_G_A chr10:91528494-91528495:+ ENST00000260753.8… 30ff5… AGCCAA…
#> 5 chr10_91528535_G_A chr10:91528148-91528537:+ ENST00000371728.7… 83411… AGCCAA…
#> 6 chr10_91528535_G_A chr10:91528148-91532446:+ ENST00000371728.7… f7ae9… AGCCAA…
Splice2neo can annotate the resulting peptide sequence for each
altered transcript with add_peptide()
. This is relevant to
analyze which altered transcript could qualify as neoantigens. We can
define by the parameter flanking_size
the number of
wild-type amino acids flanking the junction or novel sequence caused by
the splice junction to the left and to the right. If the junction leads
to a reading frame shift, sequences are flanked by wild-type amino acids
to the left and are annotated until the first stop codon.
noncan_mut_junctions <- noncan_mut_junctions %>%
add_peptide(flanking_size = 13,
bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
cds = cds)
noncan_mut_junctions %>%
dplyr::select(mut_id,
junc_id,
tx_id,
junc_in_orf,
peptide_context,
frame_shift) %>%
head()
#> # A tibble: 6 × 6
#> mut_id junc_id tx_id junc_in_orf peptide_context frame_shift
#> <chr> <chr> <chr> <lgl> <chr> <lgl>
#> 1 chr10_8006689_C_T chr10:786652… ENST… TRUE IPDYLPPIVSSQED… TRUE
#> 2 chr10_91528535_G_A chr10:915281… ENST… TRUE SDDRNSSVKKEQKS… FALSE
#> 3 chr10_91528535_G_A chr10:915281… ENST… TRUE SDDRNSSVKKEQKQ… TRUE
#> 4 chr10_91528535_G_A chr10:915284… ENST… TRUE SDDRNSSVKKEQKV… TRUE
#> 5 chr10_91528535_G_A chr10:915281… ENST… TRUE SDDRNSSVKKEQKS… FALSE
#> 6 chr10_91528535_G_A chr10:915281… ENST… TRUE SDDRNSSVKKEQKQ… TRUE
Relevant intron retention events are particular challenging to
predict. Splice2neo provides the function exon_in_intron()
to annotate whether the exon of another transcript is in the genomic
region of a predicted retained introns. We do not want to consider them
in the downstream analysis as it is difficult to differentiate if the
read support from RNA-seq relates to the retained intron or to the exon
of the other transcript.
# annotate exons in introns of other transcript
noncan_mut_junctions <-
noncan_mut_junctions %>%
exon_in_intron(transcripts = transcripts)
Altered transcript from predicted splice junctions can be
re-quantified in RNA-seq in a targeted manner with EasyQuant (available
at https://github.com/TRON-Bioinformatics/easyquant).
EasyQuant maps all RNA-seq reads to to a custom reference build from the
context sequence around splice junctions. We can generate the input for
EasyQuant using the splice2neo function
transform_for_requant()
.
EasyQuant should be run in interval mode for the re-quantification of
splice junction and it is advised to set the BP_distannce
parameter to 10 (see EasyQuant
README).
# transform for EasyQuant
easyquant_input <- noncan_mut_junctions %>%
transform_for_requant()
head(easyquant_input)
#> # A tibble: 6 × 3
#> name sequence position
#> <chr> <chr> <chr>
#> 1 dd63de34c46767f447b72d5ab223df47 CCAGCTGATGGGGGTTAGTCTACATGAACTAGAAG… 200
#> 2 834110bae711a18586e9745ce37e6d33 AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 200
#> 3 f7ae934315938fe4c3b41bc44b74304c AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 200
#> 4 30ff58139be454c8ca52c4a28bf8a000 AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 0,200,5…
#> 5 7d57cc792100728dd7c9d6cb03f0bae2 CCCAGCAACTGAATGACTTCTGTAAACCCAGGAGT… 200
#> 6 29bc2a21dd2b55a13ddc993fc8be5086 TCGCATGGAGCGCGAGCCCAGCGCCTCGGAGGCCG… 200
After running EasyQuant (which is not shown here), we can import the
EasyQuant results with the splice2neo function
map_requant()
.
# path to easyquant folder
easyquant_path <- system.file("extdata/example/easyquant", package="splice2neo")
# merge easyquant results
requant_noncan_mut_junctions <-
map_requant(path_to_easyquant_folder = easyquant_path,
junc_tib = noncan_mut_junctions)
#> Rows: 261 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): name, interval
#> dbl (6): overlap_interval_end_reads, span_interval_end_pairs, within_interva...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Here, we are interested in different result columns depending on the
type of splice event. In case of alternative splice sites or exon
skipping events, want use the number of junction reads that map on the
splice junction (junc_interval_start
) and/or the number of
spanning reads that bridge the splice junction
(span_interval_start
).
# re-quantification results
requant_noncan_mut_junctions %>%
filter(spliceai_event_type != "intron retention") %>%
dplyr::select(mut_id, junc_id, tx_id, junc_interval_start, span_interval_start) %>%
head()
#> # A tibble: 6 × 5
#> mut_id junc_id tx_id junc_interval_start span_interval_start
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 chr10_91528535_G_A chr10:915281… ENST… 3 10
#> 2 chr10_91528535_G_A chr10:915281… ENST… 0 9
#> 3 chr10_91528535_G_A chr10:915281… ENST… 3 10
#> 4 chr10_91528535_G_A chr10:915281… ENST… 0 9
#> 5 chr10_91528535_G_A chr10:915281… ENST… 3 10
#> 6 chr10_91528535_G_A chr10:915281… ENST… 0 9
In case of intron retentions, we may want to consider the following columns:
within_interval
: Number of reads that map to the intron
of interestcoverage_perc
: Relative read coverage of the intron of
interestcoverage_median
: Median read coverage of the intron of
interestcoverage_mean
: Mean read coverage of the intron of
interest. This value can be misleading by skewed read distribution and
the user may rather want to use the median coverage
# re-quantification results
requant_noncan_mut_junctions %>%
filter(spliceai_event_type == "intron retention") %>%
dplyr::select(
mut_id,
junc_id,
tx_id,
within_interval,
coverage_perc,
coverage_median,
coverage_mean
) %>%
head()
#> # A tibble: 6 × 7
#> mut_id junc_id tx_id within_interval coverage_perc coverage_median
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 chr10_91528535_G_A chr10:… ENST… 1 0.543 1
#> 2 chr10_91528535_G_A chr10:… ENST… 1 0.543 1
#> 3 chr10_91528535_G_A chr10:… ENST… 1 0.543 1
#> 4 chr11_18028133_C_T chr11:… ENST… 85 0.937 4
#> 5 chr11_18028133_C_T chr11:… ENST… 85 0.937 4
#> 6 chr11_18028133_C_T chr11:… ENST… 85 0.937 4
#> # ℹ 1 more variable: coverage_mean <dbl>
# alternative to only import EasyQuant results without merging
requantification_results <-
read_requant(path_folder = easyquant_path)
#> Rows: 261 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): name, interval
#> dbl (6): overlap_interval_end_reads, span_interval_end_pairs, within_interva...
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(requantification_results)
#> # A tibble: 6 × 20
#> name junc_interval_start junc_interval_end span_interval_start
#> <chr> <dbl> <dbl> <dbl>
#> 1 5846baef085e1cb83c8… 0 1 8
#> 2 e0ab8366d4f7014c521… 0 1 9
#> 3 fcc744d7184c3bdae26… 1 0 7
#> 4 13490cdd721e01c1334… 0 1 3
#> 5 3f5a72af0853d8b59e6… 2 0 6
#> 6 48ba45ad78650a6d3b6… 2 0 3
#> # ℹ 16 more variables: span_interval_end <dbl>, within_interval <dbl>,
#> # within_interval_left <dbl>, within_interval_right <dbl>,
#> # coverage_perc <dbl>, coverage_perc_left <dbl>, coverage_perc_right <dbl>,
#> # coverage_mean <dbl>, coverage_mean_left <dbl>, coverage_mean_right <dbl>,
#> # coverage_median <dbl>, coverage_median_left <dbl>,
#> # coverage_median_right <dbl>, interval <chr>, interval_left <chr>,
#> # interval_right <chr>
In the analysis described above we identified mutation-retrieved splice junctions supported by RNA-seq. We have developed a stringent detection rule to predict tumor-specific targets from mutation-associated splice targets based on the mutation effect and the RNA-seq support (See Lang et al. (2023)). Currently, we suggest to focus on events from alternative splice sites and exon skipping and define such splice junctions as targets that have a |splice effect score| 0.35 and are found by LeafCutter/SplAdder or have at least 3 junction reads.
# exclude intron retention
potential_targets <- requant_noncan_mut_junctions %>%
filter(spliceai_event_type != "intron retention")
potential_targets %>%
select(junc_id, tx_id, spliceai_score, mmsplice_delta_logit_psi, is_in_rnaseq, junc_interval_start)
#> # A tibble: 116 × 6
#> junc_id tx_id spliceai_score mmsplice_delta_logit…¹ is_in_rnaseq
#> <chr> <chr> <dbl> <dbl> <lgl>
#> 1 chr10:91528148-9152… ENST… 0.64 NA FALSE
#> 2 chr10:91528148-9153… ENST… 0.22 -0.267 FALSE
#> 3 chr10:91528148-9152… ENST… 0.64 NA FALSE
#> 4 chr10:91528148-9153… ENST… 0.22 -0.267 FALSE
#> 5 chr10:91528148-9152… ENST… 0.64 NA FALSE
#> 6 chr10:91528148-9153… ENST… 0.22 -0.267 FALSE
#> 7 chr11:18026082-1802… ENST… 0.09 NA FALSE
#> 8 chr11:18026082-1802… ENST… 0.15 -2.50 TRUE
#> 9 chr11:18026082-1802… ENST… 0.09 NA FALSE
#> 10 chr11:18026082-1802… ENST… 0.15 -2.50 TRUE
#> # ℹ 106 more rows
#> # ℹ abbreviated name: ¹mmsplice_delta_logit_psi
#> # ℹ 1 more variable: junc_interval_start <dbl>
# apply detection rule
targets <- potential_targets %>%
filter(spliceai_score >= 0.35 | mmsplice_delta_logit_psi < -0.35) %>%
filter(is_in_rnaseq | junc_interval_start >= 3)
The data in targets
can contain several altered
transcripts per junctions which may lead to several neoantigen candidate
sequences per target splice junction.
# apply detection rule
targets %>%
distinct(junc_id, peptide_context) %>%
count(junc_id, name = "# neoantigen candidates") %>%
arrange(-`# neoantigen candidates`)
#> # A tibble: 7 × 2
#> junc_id `# neoantigen candidates`
#> <chr> <int>
#> 1 chr22:20921104-20922809:+ 3
#> 2 chr10:91528148-91528537:+ 2
#> 3 chr11:18026082-18029488:- 2
#> 4 chr16:27692851-27709650:+ 2
#> 5 chr2:8910961-8916878:- 2
#> 6 chrX:39934433-39937097:- 2
#> 7 chrX:76944420-76953071:- 1
Here, we have a list of splice junctions and want to know all potential affected transcripts.
Given a data.frame
of splice junctions (with a column
junc_id
), the function add_tx()
annotates all
possible transcripts which overlap with the splice junctions.
junc_df <- dplyr::tibble(
junc_id = toy_junc_id[c(1, 6, 10)]
)
junc_df <- junc_df %>%
add_tx(toy_transcripts)
junc_df
#> # A tibble: 21 × 3
#> junc_id tx_id tx_lst
#> <chr> <chr> <named list>
#> 1 chr2:152389996-152392205:- ENST00000409198 <GRanges>
#> 2 chr2:152389996-152392205:- ENST00000172853 <GRanges>
#> 3 chr2:152389996-152392205:- ENST00000397345 <GRanges>
#> 4 chr2:152389996-152392205:- ENST00000427231 <GRanges>
#> 5 chr2:152389996-152392205:- ENST00000618972 <GRanges>
#> 6 chr2:152389996-152392205:- ENST00000413693 <GRanges>
#> 7 chr2:152389996-152392205:- ENST00000603639 <GRanges>
#> 8 chr2:152389996-152392205:- ENST00000604864 <GRanges>
#> 9 chr2:152389996-152392205:- ENST00000420924 <GRanges>
#> 10 chr2:179415981-179416357:- ENST00000342992 <GRanges>
#> # ℹ 11 more rows
This can lead to large data sets containing many highly unlikely
junction transcript combinations. We can select a subset of transcripts
per junction that are more likely to be affected by a junction with
choose_tx()
. Please note that this function may loose
relevant or keep irrelevant junction-transcripts in particular in
regions with multiple isoforms with distinct splicing pattern.
selected_junc_df <- junc_df %>%
choose_tx()
selected_junc_df
#> # A tibble: 21 × 4
#> junc_id tx_id tx_lst putative_event_type
#> <chr> <chr> <named list> <chr>
#> 1 chr2:152389996-152392205:- ENST00000409198 <GRanges> ASS
#> 2 chr2:152389996-152392205:- ENST00000172853 <GRanges> ASS
#> 3 chr2:152389996-152392205:- ENST00000397345 <GRanges> ASS
#> 4 chr2:152389996-152392205:- ENST00000427231 <GRanges> ASS
#> 5 chr2:152389996-152392205:- ENST00000618972 <GRanges> ASS
#> 6 chr2:152389996-152392205:- ENST00000413693 <GRanges> ASS
#> 7 chr2:152389996-152392205:- ENST00000603639 <GRanges> ASS
#> 8 chr2:152389996-152392205:- ENST00000604864 <GRanges> ASS
#> 9 chr2:152389996-152392205:- ENST00000420924 <GRanges> ASS
#> 10 chr2:179415981-179416357:- ENST00000342992 <GRanges> ASS
#> # ℹ 11 more rows
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.4 splice2neo_0.6.12
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.3 bitops_1.0-9
#> [3] httr2_1.0.6 permute_0.9-7
#> [5] biomaRt_2.62.0 rlang_1.1.4
#> [7] magrittr_2.0.3 vcfR_1.15.0
#> [9] matrixStats_1.4.1 compiler_4.4.2
#> [11] RSQLite_2.3.7 mgcv_1.9-1
#> [13] GenomicFeatures_1.58.0 png_0.1-8
#> [15] systemfonts_1.1.0 vctrs_0.6.5
#> [17] txdbmaker_1.2.0 stringr_1.5.1
#> [19] memuse_4.2-3 pkgconfig_2.0.3
#> [21] crayon_1.5.3 fastmap_1.2.0
#> [23] dbplyr_2.5.0 XVector_0.46.0
#> [25] utf8_1.2.4 Rsamtools_2.22.0
#> [27] rmarkdown_2.29 tzdb_0.4.0
#> [29] UCSC.utils_1.2.0 ragg_1.3.3
#> [31] purrr_1.0.2 bit_4.5.0
#> [33] xfun_0.49 zlibbioc_1.52.0
#> [35] cachem_1.1.0 GenomeInfoDb_1.42.0
#> [37] jsonlite_1.8.9 progress_1.2.3
#> [39] blob_1.2.4 DelayedArray_0.32.0
#> [41] BiocParallel_1.40.0 cluster_2.1.6
#> [43] parallel_4.4.2 prettyunits_1.2.0
#> [45] R6_2.5.1 bslib_0.8.0
#> [47] stringi_1.8.4 rtracklayer_1.66.0
#> [49] GenomicRanges_1.58.0 jquerylib_0.1.4
#> [51] Rcpp_1.0.13-1 SummarizedExperiment_1.36.0
#> [53] knitr_1.48 readr_2.1.5
#> [55] IRanges_2.40.0 splines_4.4.2
#> [57] Matrix_1.7-1 tidyselect_1.2.1
#> [59] abind_1.4-8 yaml_2.3.10
#> [61] vegan_2.6-8 codetools_0.2-20
#> [63] curl_6.0.0 lattice_0.22-6
#> [65] tibble_3.2.1 Biobase_2.66.0
#> [67] withr_3.0.2 KEGGREST_1.46.0
#> [69] evaluate_1.0.1 pinfsc50_1.3.0
#> [71] desc_1.4.3 BiocFileCache_2.14.0
#> [73] xml2_1.3.6 Biostrings_2.74.0
#> [75] pillar_1.9.0 filelock_1.0.3
#> [77] MatrixGenerics_1.18.0 stats4_4.4.2
#> [79] generics_0.1.3 vroom_1.6.5
#> [81] RCurl_1.98-1.16 S4Vectors_0.44.0
#> [83] hms_1.1.3 glue_1.8.0
#> [85] tools_4.4.2 BiocIO_1.16.0
#> [87] BSgenome_1.74.0 GenomicAlignments_1.42.0
#> [89] fs_1.6.5 XML_3.99-0.17
#> [91] grid_4.4.2 tidyr_1.3.1
#> [93] ape_5.8 AnnotationDbi_1.68.0
#> [95] nlme_3.1-166 GenomeInfoDbData_1.2.13
#> [97] restfulr_0.0.15 cli_3.6.3
#> [99] rappdirs_0.3.3 textshaping_0.4.0
#> [101] fansi_1.0.6 viridisLite_0.4.2
#> [103] S4Arrays_1.6.0 BSgenome.Hsapiens.UCSC.hg19_1.4.3
#> [105] sass_0.4.9 digest_0.6.37
#> [107] BiocGenerics_0.52.0 SparseArray_1.6.0
#> [109] rjson_0.2.23 htmlwidgets_1.6.4
#> [111] memoise_2.0.1 htmltools_0.5.8.1
#> [113] pkgdown_2.1.1 lifecycle_1.0.4
#> [115] httr_1.4.7 MASS_7.3-61
#> [117] bit64_4.5.2