Introduction

The R package splice2neo provides multiple functions for the analysis of splice junctions. The main focus is on the association of splice junctions with somatic mutations. The package integrates the output of several tools that predict splicing effects from mutations. It annotates these mutation effects with the resulting splicing event (alternative 3’/5’ splice sites, exons skipping, intron retentions) and the resulting splice junctions in a standardized format. Expressed splice junctions detected by external tools from RNA-seq data can be parsed and integrated. Splice junctions can be filtered against canonical splice junctions and annotated with the altered transcript sequences, CDS, and resulting peptide sequences. Splice2neo was mainly developed to support the detection of neoantigen candidates from tumor-specific splice junctions. Neoantigens are tumor-specific mutated peptides recognized by T-cells and utilized in immunotherapies against tumors (Lang et al. 2022). However, most functions from the splice2neo package are also useful for other types of splicing analysis.

In this vignette, we use use some toy example data to demonstrate how splice2neo can be used to analyze splice-junctions which are derived from mutation predictions as tumor-specific neoantigen candidates (Lang et al. 2023).

Installation

The R package splice2neo is not yet on CRAN or Bioconductor. Therefore, you need to install splice2neo from github.

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

remotes::install_github("TRON-Bioinformatics/splice2neo")

Example workflow to predict mutation-associated alternative splicing

Integrating splice2neo functions and detection rules based on splice effect scores and RNA-seq support facilitates the identification of mutation-associated splice junctions which are tumor-specific and can encode neoantigen candidates. The following workflow explains how to use splice2neo to predict neoantigen candidates from mutation-associated splice junctions using example data.

Building transcript database

A database of transcripts is required for transcript annotation with splice2neo. Here, we show how to build the database from a GTF file that is a subset of GENCODE annotations, but users can build these databases also from other resources.

# use gtf file of choice and transform into transcript database
gtf_file <- system.file("extdata/example/gencode.v34lift37.annotation.subset.gtf.gz", package="splice2neo")

# parse GTF file as txdb object
txdb <- GenomicFeatures::makeTxDbFromGFF(gtf_file)
#> Warning in call_fun_in_txdbmaker("makeTxDbFromGFF", ...): makeTxDbFromGFF() has moved to the txdbmaker package. Please call
#>   txdbmaker::makeTxDbFromGFF() to get rid of this warning.
#> Import genomic features from the file as a GRanges object ... OK
#> Prepare the 'metadata' data frame ... OK
#> Make the TxDb object ...
#> Warning in .get_cds_IDX(mcols0$type, mcols0$phase): The "phase" metadata column contains non-NA values for features of type
#>   stop_codon. This information was ignored.
#> OK
# optionally save database for later re-use 
# saveDb(txdb, file = "/path/to/transripts/txdb.sqlite")
# Build transcript database
transcripts <-
  GenomicFeatures::exonsBy(txdb, by = c("tx"), use.names = TRUE)
transcripts_gr <- GenomicFeatures::transcripts(txdb, columns = c("gene_id", "tx_id", "tx_name"))

# Build cds database
cds <- GenomicFeatures::cdsBy(txdb, by = c("tx"), use.name = TRUE)

Canonical junctions

We can retrieve the canonical splice junctions between all adjacent exons from a list of transcripts with canonical_junctions().

ref_junc <- canonical_junctions(transcripts)

str(ref_junc)
#>  chr [1:1316] "chr2:28680064-28739675:+" "chr2:28739736-28741333:+" ...

Splice junction id

In splice2neo, splice junctions are represented by a junc_id, which is defined by the genomic coordinates of the exon positions (1-based) that are combined through the junction and the transcription strand of the transcript: chr:pos1-pos2:strand.

This format allows direct conversion to GRangs objects from the GenomicRanges package.

gr <- GenomicRanges::GRanges(ref_junc)

gr
#> GRanges object with 1316 ranges and 0 metadata columns:
#>          seqnames            ranges strand
#>             <Rle>         <IRanges>  <Rle>
#>      [1]     chr2 28680064-28739675      +
#>      [2]     chr2 28739736-28741333      +
#>      [3]     chr2 28741399-28742572      +
#>      [4]     chr2 28742630-28748134      +
#>      [5]     chr2 28748174-28748772      +
#>      ...      ...               ...    ...
#>   [1312]     chrX 76949322-76949323      -
#>   [1313]     chrX 76944394-76944395      -
#>   [1314]     chrX 76952175-76952176      -
#>   [1315]     chrX 77041690-77041691      -
#>   [1316]     chrX 76953090-76953091      -
#>   -------
#>   seqinfo: 13 sequences from an unspecified genome; no seqlengths

A given GRanges object with ranges of splice junctions can be converted into a vector of `junc_id``s:

junc_id <- as.character(gr)

head(junc_id)
#> [1] "chr2:28680064-28739675:+" "chr2:28739736-28741333:+"
#> [3] "chr2:28741399-28742572:+" "chr2:28742630-28748134:+"
#> [5] "chr2:28748174-28748772:+" "chr2:28748812-28752184:+"

Annotation of RNA-seq derived splice junctions

Splice2neo transforms splice events identified in RNA-seq into a standardized junction format. We show here how to parse the tools LeafCutter and SplAdder.

LeafCutter

The output of the tool LeafCutter (Li et al. 2018) can be transformed using leafcutter_transform():

leafcutter_path <- system.file("extdata/example/leafcutter", package="splice2neo")

leafcutter <-
  leafcutter_transform(path = leafcutter_path)
#> Importing Leafcutter files...
#> Importing Leafcutter files completed


head(leafcutter)
#> # A tibble: 6 × 7
#>   junction_start junction_end strand chromosome Gene  class junc_id             
#>   <chr>          <chr>        <chr>  <chr>      <lgl> <lgl> <chr>               
#> 1 39934433       39937097     -      chrX       NA    NA    chrX:39934433-39937…
#> 2 76944420       76953071     -      chrX       NA    NA    chrX:76944420-76953…
#> 3 18026082       18029488     -      chr11      NA    NA    chr11:18026082-1802…
#> 4 20921104       20922809     +      chr22      NA    NA    chr22:20921104-2092…
#> 5 153333024      153456093    -      chr4       NA    NA    chr4:153333024-1534…
#> 6 32084907       32085100     -      chr6       NA    NA    chr6:32084907-32085…

SplAdder

The output of the tool SplAdder (Kahles et al. 2016) can be transformed using spladder_transform():

spladder_path <- system.file("extdata/example/spladder", package="splice2neo")

spladder <-
  spladder_transform(path = spladder_path)
#> Importing Spladder files ...
#> Importing Spladder files completed


head(spladder)
#> # A tibble: 6 × 8
#>   junction_start junction_end strand chromosome Gene               class junc_id
#>   <chr>          <chr>        <chr>  <chr>      <chr>              <chr> <chr>  
#> 1 20921104       20922808     +      chr22      ENSG00000099917.1… A3SS  chr22:…
#> 2 20921104       20922809     +      chr22      ENSG00000099917.1… A3SS  chr22:…
#> 3 41108521       41109155     -      chr17      ENSG00000266967.7… A3SS  chr17:…
#> 4 41108577       41109155     -      chr17      ENSG00000266967.7… A3SS  chr17:…
#> 5 148881736      148884821    +      chr3       ENSG00000163755.9… A3SS  chr3:1…
#> 6 148881736      148884852    +      chr3       ENSG00000163755.9… A3SS  chr3:1…
#> # ℹ 1 more variable: number_supporting_reads <dbl>

Combine RNA-seq derived splice junctions

Now we want to get a single data.frame of all splice junctions identified by RNA-seq tools and information by which tool they were identified.

rna_junctions <-
  generate_combined_dataset(list("spladder_juncs" = spladder, "leafcutter_juncs" = leafcutter))

head(rna_junctions)
#> # A tibble: 6 × 10
#>   junc_id   junction_start junction_end strand chromosome identified_by_leafcu…¹
#>   <chr>     <chr>          <chr>        <chr>  <chr>      <lgl>                 
#> 1 chr1:266… 26627515       26628185     -      chr1       TRUE                  
#> 2 chr1:849… 84964231       84971694     -      chr1       TRUE                  
#> 3 chr10:82… 82267130       82269057     +      chr10      FALSE                 
#> 4 chr10:82… 82267130       82271900     +      chr10      FALSE                 
#> 5 chr10:82… 82269227       82271900     +      chr10      FALSE                 
#> 6 chr11:18… 18026082       18029488     -      chr11      TRUE                  
#> # ℹ abbreviated name: ¹​identified_by_leafcutter_juncs
#> # ℹ 4 more variables: identified_by_spladder_juncs <lgl>, Gene <chr>,
#> #   class <chr>, spladder_juncs_number_supporting_reads <dbl>

Annotation of mutation effect on splicing

Deep learning tools can predict the effect of a mutation on the disruption or creation of canonical splicing sequence motifs. This is typically provided as gain or loss of splicing acceptor or donor sites. Splice2neo can transform and annotate such mutation effects into a unified splice junction format. Currently, the transformation of predictions by SpliceAI (Jaganathan et al. 2019), MMSplice(Cheng et al. 2019) and Pangolin(Zeng and Li 2022) are supported.
Here, we show how to annotate results from SpliceAI and MMSplice. The annotation of Pangolin results works analogously with SpliceAI.

SpliceAI

Effect predictions by SpliceAI can be read with parse_spliceai()

spliceai_file <- system.file("extdata/example/spliceAI.vcf", package="splice2neo")

# parse SpliceAI results 
spliceai <-
  parse_spliceai(spliceai_file)

head(spliceai)
#> # A tibble: 6 × 18
#>   CHROM POS      ID    REF   ALT   QUAL  FILTER   Key ALLELE SYMBOL  DS_AG DS_AL
#>   <chr> <chr>    <chr> <chr> <chr> <chr> <chr>  <int> <chr>  <chr>   <dbl> <dbl>
#> 1 chr10 8006689  NA    C     T     NA    PASS       1 T      TAF3     0     0   
#> 2 chr10 91528535 NA    G     A     NA    PASS       2 A      KIF20B   0.64  0.22
#> 3 chr11 18028133 NA    C     T     NA    PASS       3 T      SERGEF   0     0   
#> 4 chr11 62985208 NA    C     T     NA    PASS       4 T      SLC22A…  0.06  0.55
#> 5 chr16 27709648 NA    G     A     NA    PASS       5 A      KIAA05…  0.97  1   
#> 6 chr16 30020302 NA    G     A     NA    PASS       6 A      DOC2A    0     0   
#> # ℹ 6 more variables: DS_DG <dbl>, DS_DL <dbl>, DP_AG <int>, DP_AL <int>,
#> #   DP_DG <int>, DP_DL <int>

Next, SpliceAi effects can be transformed into a standardized effect dataframe with format_spliceai(). Here, we can optionally provide a data.frame (gene_table) which relates gene symbols with GENCODE gene ids. If gene_table is provided, the formatted data.frame contains the GENCODE gene ids. While this is advised to better select relevant transcripts, it is not required.

gene_table <-
  readr::read_tsv(system.file("extdata/example/gene_table.tsv", package =
                                "splice2neo"))
#> Rows: 17 Columns: 3
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (3): strand, gene_id, gene_name
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(gene_table)
#> # A tibble: 6 × 3
#>   strand gene_id              gene_name
#>   <chr>  <chr>                <chr>    
#> 1 -      ENSG00000134313.16_8 KIDINS220
#> 2 +      ENSG00000163803.13_8 PLB1     
#> 3 -      ENSG00000163510.14_5 CWC22    
#> 4 -      ENSG00000136237.19_8 RAPGEF5  
#> 5 +      ENSG00000147316.13_4 MCPH1    
#> 6 +      ENSG00000138182.14_3 KIF20B
# format SpliceAI results
splicai_formatted <-
  format_spliceai(spliceai, gene_table = gene_table)

head(splicai_formatted)
#> # A tibble: 6 × 7
#>   mut_id             effect score chr   pos_rel      pos gene_id             
#>   <chr>              <fct>  <dbl> <chr>   <int>    <int> <chr>               
#> 1 chr10_91528535_G_A AG      0.64 chr10       2 91528537 ENSG00000138182.14_3
#> 2 chr10_91528535_G_A AL      0.22 chr10     -40 91528495 ENSG00000138182.14_3
#> 3 chr11_18028133_C_T DG      0.09 chr11      42 18028175 ENSG00000129158.11_6
#> 4 chr11_18028133_C_T DL      0.15 chr11       5 18028138 ENSG00000129158.11_6
#> 5 chr11_62985208_C_T AG      0.06 chr11      -2 62985206 ENSG00000196600.12_6
#> 6 chr11_62985208_C_T AL      0.55 chr11      -1 62985207 ENSG00000196600.12_6

Predicted SpliceAI effects can be annotated with all potential resulting junctions and transcripts with annotate_mut_effect(). If the input was formatted using gene_table, the user may want to choose gene_mapping = TRUE which removes unlikely events from the results. Otherwise, the default gene_mapping = FALSE should be used. The resulting dataframe contains all potential alternative transcripts per row, meaning that there can be multiple annotated transcripts per junction and also multiple junctions per mutation effect. At this point, no filtering based on effect size has been applied meaning that the dataframe contains unlikely events.

# annotate SpliceAI junctions with all potential junctions and transcripts
spliceai_annotated <-
  annotate_mut_effect(
    effect_df = splicai_formatted,
    transcripts = transcripts,
    transcripts_gr = transcripts_gr,
    gene_mapping = TRUE
  )
#> INFO: calculate coordinates of upstream and downstream exons...
#> INFO: get overlapping transcripts...
#> INFO: Build df ...
#> INFO: build pairs...
#> INFO: find overlapping exons...
#> INFO: calculate junction coordinates from predicted effect...
#> INFO: Evaluation of rules done.


spliceai_annotated %>% 
  dplyr::select(mut_id, junc_id, gene_id, tx_id, effect, score) %>% 
  head()
#> # A tibble: 6 × 6
#>   mut_id             junc_id                   gene_id        tx_id effect score
#>   <chr>              <chr>                     <chr>          <chr> <chr>  <dbl>
#> 1 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG      0.64
#> 2 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG      0.64
#> 3 chr10_91528535_G_A chr10:91528148-91528537:+ ENSG000001381… ENST… AG      0.64
#> 4 chr10_91528535_G_A chr10:91528494-91528495:+ ENSG000001381… ENST… AL      0.22
#> 5 chr10_91528535_G_A chr10:91528148-91532446:+ ENSG000001381… ENST… AL      0.22
#> 6 chr10_91528535_G_A chr10:91528494-91528495:+ ENSG000001381… ENST… AL      0.22

It can happen that multiple effects lead to the same altered transcripts (defined by mut_id, junc_id, tx_id ). The function unique_mut_junc() can be used to filter for unique altered transcripts by keeping the effect with the highest effect score.

# unique altered transcripts
spliceai_annotated_unique <- unique_mut_junc(spliceai_annotated)

nrow(spliceai_annotated_unique) < nrow(spliceai_annotated)
#> [1] TRUE

MMSplice

We can read the effect predictions by MMSplice with parse_mmsplice()

mmsplice_file <- system.file("extdata/example/MMsplice.csv", package="splice2neo")

# parse MMSplice results 
mmsplice <-
  parse_mmsplice(mmsplice_file)

head(mmsplice)
#> # A tibble: 6 × 20
#>   ID               exons exon_id gene_id gene_name transcript_id delta_logit_psi
#>   <chr>            <chr> <chr>   <chr>   <chr>     <chr>                   <dbl>
#> 1 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1      ENST00000327…         0.00444
#> 2 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1      ENST00000422…         0.00444
#> 3 chr2:28764482:C… chr2… ENSE00… ENSG00… PLB1      ENST00000404…         0.00444
#> 4 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5    ENST00000624…         0.0464 
#> 5 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5    ENST00000436…         0.0464 
#> 6 chr2:97312723:C… chr2… ENSE00… ENSG00… FER1L5    ENST00000641…         0.0464 
#> # ℹ 13 more variables: ref_acceptorIntron <dbl>, ref_acceptor <dbl>,
#> #   ref_exon <dbl>, ref_donor <dbl>, ref_donorIntron <dbl>,
#> #   alt_acceptorIntron <dbl>, alt_acceptor <dbl>, alt_exon <dbl>,
#> #   alt_donor <dbl>, alt_donorIntron <dbl>, pathogenicity <dbl>,
#> #   efficiency <dbl>, mut_id <chr>

The MMSplice data can be annotated with all potential splice junctions using annotate_mmsplice()

mmsplice_annotated <- mmsplice %>%
  annotate_mmsplice(transcripts = transcripts)

head(mmsplice_annotated)
#> # A tibble: 6 × 22
#>   ID    exons exon_id gene_id gene_name tx_id delta_logit_psi ref_acceptorIntron
#>   <chr> <chr> <chr>   <chr>   <chr>     <chr>           <dbl>              <dbl>
#> 1 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> 2 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> 3 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> 4 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> 5 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> 6 chr2… chr2… ENSE00… ENSG00… PLB1      ENST…         0.00444              -4.31
#> # ℹ 14 more variables: ref_acceptor <dbl>, ref_exon <dbl>, ref_donor <dbl>,
#> #   ref_donorIntron <dbl>, alt_acceptorIntron <dbl>, alt_acceptor <dbl>,
#> #   alt_exon <dbl>, alt_donor <dbl>, alt_donorIntron <dbl>,
#> #   pathogenicity <dbl>, efficiency <dbl>, mut_id <chr>, event_type <chr>,
#> #   junc_id <chr>

Junctions predicted from MMSplice results can contain duplicated altered transcripts for exon inclusion events. We use the function unique_junc_mmsplice() to filter for unique altered transcripts by keeping the effect with the highest effect score.

# unique altered transcripts
mmsplice_annotated_unique <- unique_junc_mmsplice(mmsplice_annotated)

nrow(mmsplice_annotated_unique) < nrow(mmsplice_annotated)
#> [1] FALSE

Combine mutation-retrieved splice junctions

Next, we combine splice junctions from SpliceAI and MMSplice into one dataframe using combine_mut_junc(). This dataframe will contain unique altered transcripts (defined by mut_id, junc_id, tx_id) in rows. Tool specific information are defined in columns starting with the respective tool name.

mut_junctions <- combine_mut_junc(list(
    "spliceai" = spliceai_annotated_unique, 
    "mmsplice" = mmsplice_annotated_unique
    ))


mut_junctions %>%
  dplyr::select(mut_id, junc_id, tx_id, spliceai_event_type, spliceai_score) %>% 
  head()
#> # A tibble: 6 × 5
#>   mut_id             junc_id            tx_id spliceai_event_type spliceai_score
#>   <chr>              <chr>              <chr> <chr>                        <dbl>
#> 1 chr10_8006689_C_T  chr10:7866523-801… ENST… NA                           NA   
#> 2 chr10_91528535_G_A chr10:91528148-91… ENST… alternative 3prim             0.64
#> 3 chr10_91528535_G_A chr10:91528148-91… ENST… exon skipping                 0.22
#> 4 chr10_91528535_G_A chr10:91528494-91… ENST… intron retention              0.22
#> 5 chr10_91528535_G_A chr10:91528148-91… ENST… alternative 3prim             0.64
#> 6 chr10_91528535_G_A chr10:91528148-91… ENST… exon skipping                 0.22

Integrative analysis of mutation-retrieved splice junctions

Annotate canonical junctions

To exclude junctions expressed in healthy tissues, we here want to filter predicted junctions for non-canonical splice junctions. We can use for instance exon-exon junctions from GENCODE and junctions found in healthy tisssue samples. Splice2neo also provides the function bed_to_junc() which allows to transform a bed file into a list of splice junctions.

# this is just an example file with a few canonical junctions --> incomplete
canonical_juncs_file <- system.file("extdata/example/test_canonical.bed", package="splice2neo")

canonical_junctios <- bed_to_junc(bed_file = canonical_juncs_file, type = "exon-exon")

Annotate mutation-retrieved splice junction whether they are canonical splice junctions.

# annotated canonical junctions
mut_junctions <- mut_junctions %>%
  mutate(is_canonical = is_canonical(junc_id, ref_junc = canonical_junctios, exons_gr = transcripts))

mut_junctions %>% 
  count(is_canonical)
#> # A tibble: 2 × 2
#>   is_canonical     n
#>   <lgl>        <int>
#> 1 FALSE          202
#> 2 TRUE           124

# remove canonical junctions 
noncan_mut_junctions <- mut_junctions %>% 
  filter(!is_canonical)

Annotate support in RNA-seq derived splice junctions

Annotate whether mutation-retrieved splice junction were also detected by a RNA-seq tool.

noncan_mut_junctions <- noncan_mut_junctions %>%
  mutate(is_in_rnaseq = is_in_rnaseq(junc_id, rna_juncs = rna_junctions$junc_id))

Annotate transcript context sequence

Splice2neo can annotate the resulting transcript sequence for each altered transcript with add_context_seq(). Here, we set the size parameter to 400 to extract the sequence of +/-200 nucleotides around the splice junctions as context sequence.

noncan_mut_junctions <- noncan_mut_junctions %>%
  add_context_seq(
    size = 400,
    bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
    transcripts = transcripts
  )

noncan_mut_junctions %>%
  dplyr::select(mut_id, junc_id, tx_id, cts_id, cts_seq) %>%
  head()
#> # A tibble: 6 × 5
#>   mut_id             junc_id                   tx_id              cts_id cts_seq
#>   <chr>              <chr>                     <chr>              <chr>  <chr>  
#> 1 chr10_8006689_C_T  chr10:7866523-8019204:+   ENST00000344293.6… dd63d… CCAGCT…
#> 2 chr10_91528535_G_A chr10:91528148-91528537:+ ENST00000260753.8… 83411… AGCCAA…
#> 3 chr10_91528535_G_A chr10:91528148-91532446:+ ENST00000260753.8… f7ae9… AGCCAA…
#> 4 chr10_91528535_G_A chr10:91528494-91528495:+ ENST00000260753.8… 30ff5… AGCCAA…
#> 5 chr10_91528535_G_A chr10:91528148-91528537:+ ENST00000371728.7… 83411… AGCCAA…
#> 6 chr10_91528535_G_A chr10:91528148-91532446:+ ENST00000371728.7… f7ae9… AGCCAA…

Annotate peptide context sequence

Splice2neo can annotate the resulting peptide sequence for each altered transcript with add_peptide(). This is relevant to analyze which altered transcript could qualify as neoantigens. We can define by the parameter flanking_size the number of wild-type amino acids flanking the junction or novel sequence caused by the splice junction to the left and to the right. If the junction leads to a reading frame shift, sequences are flanked by wild-type amino acids to the left and are annotated until the first stop codon.

noncan_mut_junctions <- noncan_mut_junctions %>%
  add_peptide(flanking_size = 13,
              bsg = BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19,
              cds = cds)

noncan_mut_junctions %>%
  dplyr::select(mut_id,
                junc_id,
                tx_id,
                junc_in_orf,
                peptide_context,
                frame_shift) %>%
  head()
#> # A tibble: 6 × 6
#>   mut_id             junc_id       tx_id junc_in_orf peptide_context frame_shift
#>   <chr>              <chr>         <chr> <lgl>       <chr>           <lgl>      
#> 1 chr10_8006689_C_T  chr10:786652… ENST… TRUE        IPDYLPPIVSSQED… TRUE       
#> 2 chr10_91528535_G_A chr10:915281… ENST… TRUE        SDDRNSSVKKEQKS… FALSE      
#> 3 chr10_91528535_G_A chr10:915281… ENST… TRUE        SDDRNSSVKKEQKQ… TRUE       
#> 4 chr10_91528535_G_A chr10:915284… ENST… TRUE        SDDRNSSVKKEQKV… TRUE       
#> 5 chr10_91528535_G_A chr10:915281… ENST… TRUE        SDDRNSSVKKEQKS… FALSE      
#> 6 chr10_91528535_G_A chr10:915281… ENST… TRUE        SDDRNSSVKKEQKQ… TRUE

Annotate if predicted retained introns are covered by an exon of another transcript

Relevant intron retention events are particular challenging to predict. Splice2neo provides the function exon_in_intron() to annotate whether the exon of another transcript is in the genomic region of a predicted retained introns. We do not want to consider them in the downstream analysis as it is difficult to differentiate if the read support from RNA-seq relates to the retained intron or to the exon of the other transcript.

# annotate exons in introns of other transcript
noncan_mut_junctions <-
  noncan_mut_junctions %>%
  exon_in_intron(transcripts = transcripts)

Re-quantification of altered transcripts with EasyQuant

Altered transcript from predicted splice junctions can be re-quantified in RNA-seq in a targeted manner with EasyQuant (available at https://github.com/TRON-Bioinformatics/easyquant). EasyQuant maps all RNA-seq reads to to a custom reference build from the context sequence around splice junctions. We can generate the input for EasyQuant using the splice2neo function transform_for_requant().
EasyQuant should be run in interval mode for the re-quantification of splice junction and it is advised to set the BP_distannce parameter to 10 (see EasyQuant README).

# transform for EasyQuant
easyquant_input <- noncan_mut_junctions %>%
  transform_for_requant()

head(easyquant_input)
#> # A tibble: 6 × 3
#>   name                             sequence                             position
#>   <chr>                            <chr>                                <chr>   
#> 1 dd63de34c46767f447b72d5ab223df47 CCAGCTGATGGGGGTTAGTCTACATGAACTAGAAG… 200     
#> 2 834110bae711a18586e9745ce37e6d33 AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 200     
#> 3 f7ae934315938fe4c3b41bc44b74304c AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 200     
#> 4 30ff58139be454c8ca52c4a28bf8a000 AGCCAAACAAAATGGCAGTGAAACACCCTGGTTGT… 0,200,5…
#> 5 7d57cc792100728dd7c9d6cb03f0bae2 CCCAGCAACTGAATGACTTCTGTAAACCCAGGAGT… 200     
#> 6 29bc2a21dd2b55a13ddc993fc8be5086 TCGCATGGAGCGCGAGCCCAGCGCCTCGGAGGCCG… 200

After running EasyQuant (which is not shown here), we can import the EasyQuant results with the splice2neo function map_requant().

# path to easyquant folder 
easyquant_path <- system.file("extdata/example/easyquant", package="splice2neo")

# merge easyquant results 
requant_noncan_mut_junctions <-
  map_requant(path_to_easyquant_folder = easyquant_path,
              junc_tib = noncan_mut_junctions)
#> Rows: 261 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): name, interval
#> dbl (6): overlap_interval_end_reads, span_interval_end_pairs, within_interva...
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here, we are interested in different result columns depending on the type of splice event. In case of alternative splice sites or exon skipping events, want use the number of junction reads that map on the splice junction (junc_interval_start) and/or the number of spanning reads that bridge the splice junction (span_interval_start).

# re-quantification results
requant_noncan_mut_junctions %>% 
  filter(spliceai_event_type != "intron retention") %>% 
  dplyr::select(mut_id, junc_id, tx_id, junc_interval_start, span_interval_start) %>% 
  head()
#> # A tibble: 6 × 5
#>   mut_id             junc_id       tx_id junc_interval_start span_interval_start
#>   <chr>              <chr>         <chr>               <dbl>               <dbl>
#> 1 chr10_91528535_G_A chr10:915281… ENST…                   3                  10
#> 2 chr10_91528535_G_A chr10:915281… ENST…                   0                   9
#> 3 chr10_91528535_G_A chr10:915281… ENST…                   3                  10
#> 4 chr10_91528535_G_A chr10:915281… ENST…                   0                   9
#> 5 chr10_91528535_G_A chr10:915281… ENST…                   3                  10
#> 6 chr10_91528535_G_A chr10:915281… ENST…                   0                   9

In case of intron retentions, we may want to consider the following columns:

  • within_interval: Number of reads that map to the intron of interest
  • coverage_perc: Relative read coverage of the intron of interest
  • coverage_median: Median read coverage of the intron of interest
  • coverage_mean: Mean read coverage of the intron of interest. This value can be misleading by skewed read distribution and the user may rather want to use the median coverage
# re-quantification results
requant_noncan_mut_junctions %>%
  filter(spliceai_event_type == "intron retention") %>%
  dplyr::select(
    mut_id,
    junc_id,
    tx_id,
    within_interval,
    coverage_perc,
    coverage_median,
    coverage_mean
  ) %>%
  head()
#> # A tibble: 6 × 7
#>   mut_id             junc_id tx_id within_interval coverage_perc coverage_median
#>   <chr>              <chr>   <chr>           <dbl>         <dbl>           <dbl>
#> 1 chr10_91528535_G_A chr10:… ENST…               1         0.543               1
#> 2 chr10_91528535_G_A chr10:… ENST…               1         0.543               1
#> 3 chr10_91528535_G_A chr10:… ENST…               1         0.543               1
#> 4 chr11_18028133_C_T chr11:… ENST…              85         0.937               4
#> 5 chr11_18028133_C_T chr11:… ENST…              85         0.937               4
#> 6 chr11_18028133_C_T chr11:… ENST…              85         0.937               4
#> # ℹ 1 more variable: coverage_mean <dbl>
# alternative to only import EasyQuant results without merging 

requantification_results  <-
  read_requant(path_folder = easyquant_path)
#> Rows: 261 Columns: 8
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (2): name, interval
#> dbl (6): overlap_interval_end_reads, span_interval_end_pairs, within_interva...
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(requantification_results)
#> # A tibble: 6 × 20
#>   name                 junc_interval_start junc_interval_end span_interval_start
#>   <chr>                              <dbl>             <dbl>               <dbl>
#> 1 5846baef085e1cb83c8…                   0                 1                   8
#> 2 e0ab8366d4f7014c521…                   0                 1                   9
#> 3 fcc744d7184c3bdae26…                   1                 0                   7
#> 4 13490cdd721e01c1334…                   0                 1                   3
#> 5 3f5a72af0853d8b59e6…                   2                 0                   6
#> 6 48ba45ad78650a6d3b6…                   2                 0                   3
#> # ℹ 16 more variables: span_interval_end <dbl>, within_interval <dbl>,
#> #   within_interval_left <dbl>, within_interval_right <dbl>,
#> #   coverage_perc <dbl>, coverage_perc_left <dbl>, coverage_perc_right <dbl>,
#> #   coverage_mean <dbl>, coverage_mean_left <dbl>, coverage_mean_right <dbl>,
#> #   coverage_median <dbl>, coverage_median_left <dbl>,
#> #   coverage_median_right <dbl>, interval <chr>, interval_left <chr>,
#> #   interval_right <chr>

Prediction of mutation-associated splice targets

In the analysis described above we identified mutation-retrieved splice junctions supported by RNA-seq. We have developed a stringent detection rule to predict tumor-specific targets from mutation-associated splice targets based on the mutation effect and the RNA-seq support (See Lang et al. (2023)). Currently, we suggest to focus on events from alternative splice sites and exon skipping and define such splice junctions as targets that have a |splice effect score| \(\geq\) 0.35 and are found by LeafCutter/SplAdder or have at least 3 junction reads.

# exclude intron retention
potential_targets <- requant_noncan_mut_junctions %>%
  filter(spliceai_event_type != "intron retention")

potential_targets %>% 
  select(junc_id, tx_id, spliceai_score, mmsplice_delta_logit_psi, is_in_rnaseq, junc_interval_start)
#> # A tibble: 116 × 6
#>    junc_id              tx_id spliceai_score mmsplice_delta_logit…¹ is_in_rnaseq
#>    <chr>                <chr>          <dbl>                  <dbl> <lgl>       
#>  1 chr10:91528148-9152… ENST…           0.64                 NA     FALSE       
#>  2 chr10:91528148-9153… ENST…           0.22                 -0.267 FALSE       
#>  3 chr10:91528148-9152… ENST…           0.64                 NA     FALSE       
#>  4 chr10:91528148-9153… ENST…           0.22                 -0.267 FALSE       
#>  5 chr10:91528148-9152… ENST…           0.64                 NA     FALSE       
#>  6 chr10:91528148-9153… ENST…           0.22                 -0.267 FALSE       
#>  7 chr11:18026082-1802… ENST…           0.09                 NA     FALSE       
#>  8 chr11:18026082-1802… ENST…           0.15                 -2.50  TRUE        
#>  9 chr11:18026082-1802… ENST…           0.09                 NA     FALSE       
#> 10 chr11:18026082-1802… ENST…           0.15                 -2.50  TRUE        
#> # ℹ 106 more rows
#> # ℹ abbreviated name: ¹​mmsplice_delta_logit_psi
#> # ℹ 1 more variable: junc_interval_start <dbl>
# apply detection rule 
targets <- potential_targets %>% 
  filter(spliceai_score >= 0.35 | mmsplice_delta_logit_psi < -0.35) %>% 
  filter(is_in_rnaseq | junc_interval_start >= 3)

The data in targets can contain several altered transcripts per junctions which may lead to several neoantigen candidate sequences per target splice junction.

# apply detection rule 
targets %>% 
  distinct(junc_id, peptide_context) %>% 
  count(junc_id, name = "# neoantigen candidates") %>% 
  arrange(-`# neoantigen candidates`)
#> # A tibble: 7 × 2
#>   junc_id                   `# neoantigen candidates`
#>   <chr>                                         <int>
#> 1 chr22:20921104-20922809:+                         3
#> 2 chr10:91528148-91528537:+                         2
#> 3 chr11:18026082-18029488:-                         2
#> 4 chr16:27692851-27709650:+                         2
#> 5 chr2:8910961-8916878:-                            2
#> 6 chrX:39934433-39937097:-                          2
#> 7 chrX:76944420-76953071:-                          1

Other Splice2neo functions

Annotation of transcipts

Here, we have a list of splice junctions and want to know all potential affected transcripts.

Given a data.frame of splice junctions (with a column junc_id), the function add_tx() annotates all possible transcripts which overlap with the splice junctions.

junc_df <- dplyr::tibble(
  junc_id = toy_junc_id[c(1, 6, 10)]
)


junc_df <- junc_df %>% 
  add_tx(toy_transcripts)

junc_df
#> # A tibble: 21 × 3
#>    junc_id                    tx_id           tx_lst      
#>    <chr>                      <chr>           <named list>
#>  1 chr2:152389996-152392205:- ENST00000409198 <GRanges>   
#>  2 chr2:152389996-152392205:- ENST00000172853 <GRanges>   
#>  3 chr2:152389996-152392205:- ENST00000397345 <GRanges>   
#>  4 chr2:152389996-152392205:- ENST00000427231 <GRanges>   
#>  5 chr2:152389996-152392205:- ENST00000618972 <GRanges>   
#>  6 chr2:152389996-152392205:- ENST00000413693 <GRanges>   
#>  7 chr2:152389996-152392205:- ENST00000603639 <GRanges>   
#>  8 chr2:152389996-152392205:- ENST00000604864 <GRanges>   
#>  9 chr2:152389996-152392205:- ENST00000420924 <GRanges>   
#> 10 chr2:179415981-179416357:- ENST00000342992 <GRanges>   
#> # ℹ 11 more rows

This can lead to large data sets containing many highly unlikely junction transcript combinations. We can select a subset of transcripts per junction that are more likely to be affected by a junction with choose_tx(). Please note that this function may loose relevant or keep irrelevant junction-transcripts in particular in regions with multiple isoforms with distinct splicing pattern.

selected_junc_df <- junc_df %>% 
  choose_tx()

selected_junc_df
#> # A tibble: 21 × 4
#>    junc_id                    tx_id           tx_lst       putative_event_type
#>    <chr>                      <chr>           <named list> <chr>              
#>  1 chr2:152389996-152392205:- ENST00000409198 <GRanges>    ASS                
#>  2 chr2:152389996-152392205:- ENST00000172853 <GRanges>    ASS                
#>  3 chr2:152389996-152392205:- ENST00000397345 <GRanges>    ASS                
#>  4 chr2:152389996-152392205:- ENST00000427231 <GRanges>    ASS                
#>  5 chr2:152389996-152392205:- ENST00000618972 <GRanges>    ASS                
#>  6 chr2:152389996-152392205:- ENST00000413693 <GRanges>    ASS                
#>  7 chr2:152389996-152392205:- ENST00000603639 <GRanges>    ASS                
#>  8 chr2:152389996-152392205:- ENST00000604864 <GRanges>    ASS                
#>  9 chr2:152389996-152392205:- ENST00000420924 <GRanges>    ASS                
#> 10 chr2:179415981-179416357:- ENST00000342992 <GRanges>    ASS                
#> # ℹ 11 more rows

References

Cheng, Jun, Thi Yen Duong Nguyen, Kamil J. Cygan, Muhammed Hasan Çelik, William G. Fairbrother, žiga Avsec, and Julien Gagneur. 2019. MMSplice: Modular Modeling Improves the Predictions of Genetic Variant Effects on Splicing.” Genome Biology 20 (1): 48. https://doi.org/10.1186/s13059-019-1653-z.
Jaganathan, Kishore, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, et al. 2019. “Predicting Splicing from Primary Sequence with Deep Learning.” Cell 176 (3): 535–548.e24. https://doi.org/10.1016/j.cell.2018.12.015.
Kahles, André, Cheng Soon Ong, Yi Zhong, and Gunnar Rätsch. 2016. SplAdder: Identification, Quantification and Testing of Alternative Splicing Events from RNA-Seq Data.” Bioinformatics 32 (12): 1840–47. https://doi.org/10.1093/bioinformatics/btw076.
Lang, Franziska, Barbara Schrörs, Martin Löwer, Özlem Türeci, and Ugur Sahin. 2022. “Identification of Neoantigens for Individualized Therapeutic Cancer Vaccines.” Nature Reviews Drug Discovery, February, 1–22. https://doi.org/10.1038/s41573-021-00387-y.
Lang, Franziska, Patrick Sorn, Martin Suchan, Alina Henrich, Christian Albrecht, Nina Koehl, Aline Beicht, et al. 2023. “Prediction of Tumor-Specific Splicing from Somatic Mutations as a Source of Neoantigen Candidates.” bioRxiv. https://doi.org/10.1101/2023.06.27.546494.
Li, Yang I., David A. Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, and Jonathan K. Pritchard. 2018. “Annotation-Free Quantification of RNA Splicing Using LeafCutter.” Nature Genetics 50 (1): 151–58. https://doi.org/10.1038/s41588-017-0004-9.
Zeng, Tony, and Yang I. Li. 2022. “Predicting RNA Splicing from DNA Sequence Using Pangolin.” Genome Biology 23 (1): 103. https://doi.org/10.1186/s13059-022-02664-4.

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4      splice2neo_0.6.9
#> 
#> loaded via a namespace (and not attached):
#>   [1] DBI_1.2.3                         bitops_1.0-7                     
#>   [3] httr2_1.0.1                       permute_0.9-7                    
#>   [5] biomaRt_2.60.1                    rlang_1.1.4                      
#>   [7] magrittr_2.0.3                    vcfR_1.15.0                      
#>   [9] matrixStats_1.3.0                 compiler_4.4.1                   
#>  [11] RSQLite_2.3.7                     mgcv_1.9-1                       
#>  [13] GenomicFeatures_1.56.0            png_0.1-8                        
#>  [15] systemfonts_1.1.0                 vctrs_0.6.5                      
#>  [17] txdbmaker_1.0.1                   stringr_1.5.1                    
#>  [19] memuse_4.2-3                      pkgconfig_2.0.3                  
#>  [21] crayon_1.5.3                      fastmap_1.2.0                    
#>  [23] dbplyr_2.5.0                      XVector_0.44.0                   
#>  [25] utf8_1.2.4                        Rsamtools_2.20.0                 
#>  [27] rmarkdown_2.27                    tzdb_0.4.0                       
#>  [29] UCSC.utils_1.0.0                  ragg_1.3.2                       
#>  [31] purrr_1.0.2                       bit_4.0.5                        
#>  [33] xfun_0.45                         zlibbioc_1.50.0                  
#>  [35] cachem_1.1.0                      GenomeInfoDb_1.40.1              
#>  [37] jsonlite_1.8.8                    progress_1.2.3                   
#>  [39] blob_1.2.4                        DelayedArray_0.30.1              
#>  [41] BiocParallel_1.38.0               cluster_2.1.6                    
#>  [43] parallel_4.4.1                    prettyunits_1.2.0                
#>  [45] R6_2.5.1                          bslib_0.7.0                      
#>  [47] stringi_1.8.4                     rtracklayer_1.64.0               
#>  [49] GenomicRanges_1.56.1              jquerylib_0.1.4                  
#>  [51] Rcpp_1.0.12                       SummarizedExperiment_1.34.0      
#>  [53] knitr_1.47                        readr_2.1.5                      
#>  [55] IRanges_2.38.0                    splines_4.4.1                    
#>  [57] Matrix_1.7-0                      tidyselect_1.2.1                 
#>  [59] abind_1.4-5                       yaml_2.3.8                       
#>  [61] vegan_2.6-6.1                     codetools_0.2-20                 
#>  [63] curl_5.2.1                        lattice_0.22-6                   
#>  [65] tibble_3.2.1                      Biobase_2.64.0                   
#>  [67] withr_3.0.0                       KEGGREST_1.44.1                  
#>  [69] evaluate_0.24.0                   pinfsc50_1.3.0                   
#>  [71] desc_1.4.3                        BiocFileCache_2.12.0             
#>  [73] xml2_1.3.6                        Biostrings_2.72.1                
#>  [75] pillar_1.9.0                      filelock_1.0.3                   
#>  [77] MatrixGenerics_1.16.0             stats4_4.4.1                     
#>  [79] generics_0.1.3                    vroom_1.6.5                      
#>  [81] RCurl_1.98-1.14                   S4Vectors_0.42.0                 
#>  [83] hms_1.1.3                         glue_1.7.0                       
#>  [85] tools_4.4.1                       BiocIO_1.14.0                    
#>  [87] BSgenome_1.72.0                   GenomicAlignments_1.40.0         
#>  [89] fs_1.6.4                          XML_3.99-0.17                    
#>  [91] grid_4.4.1                        tidyr_1.3.1                      
#>  [93] ape_5.8                           AnnotationDbi_1.66.0             
#>  [95] nlme_3.1-164                      GenomeInfoDbData_1.2.12          
#>  [97] restfulr_0.0.15                   cli_3.6.3                        
#>  [99] rappdirs_0.3.3                    textshaping_0.4.0                
#> [101] fansi_1.0.6                       viridisLite_0.4.2                
#> [103] S4Arrays_1.4.1                    BSgenome.Hsapiens.UCSC.hg19_1.4.3
#> [105] sass_0.4.9                        digest_0.6.36                    
#> [107] BiocGenerics_0.50.0               SparseArray_1.4.8                
#> [109] rjson_0.2.21                      htmlwidgets_1.6.4                
#> [111] memoise_2.0.1                     htmltools_0.5.8.1                
#> [113] pkgdown_2.0.9                     lifecycle_1.0.4                  
#> [115] httr_1.4.7                        MASS_7.3-60.2                    
#> [117] bit64_4.0.5