Annotates splice junctions with resulting CDS and peptide sequence.

add_peptide(df, cds, flanking_size = 14, bsg = NULL, keep_ranges = FALSE)

Arguments

df

A data.frame with splice junctions in rows and at least the columns:

junc_id junction id consisting of genomic coordinates
tx_id the ID of the affected transcript/CDS (see add_tx)

cds

as a named GRangesList of coding sequences (CDS) ranges

flanking_size

the number of wild-type amino acids flanking the junction or novel sequence caused by the splice junction to the left and to the right. Frame shift junctions are flanked by wild-type amino acids to the left and are annotated until the first stop codon.

bsg

BSgenome object such as BSgenome.Hsapiens.UCSC.hg19

keep_ranges

Should GRanges of CDS and modified CDS be kept? If TRUE, the list columns cds_lst and cds_mod_lst are added to the output.

Value

A data.frame with the same rows as the input df but with the following additional column(s):

cds_mod_id An identifier made from tx_id and junc_id
junc_pos_cds the junction position in the modified CDS sequence
frame_shift Indicator whether junction leads to frame shift.
is_first_reading_frame Indicator whether modified CDS sequence is translated into protein sequence using the 1st reading frame (i.e. reading is starting from the first nucleotide) for in-frame peptides.
normalized_cds_junc_pos The normalized position of the junction in the modified CDS sequence to the left junction side.
protein The full protein sequence of the translated modified CDS.
normalized_protein_junc_pos The normalized position of the junction in the protein sequence to the left junction side.
peptide_context_junc_pos The junction position relative to the peptide_context sequence
junc_in_orf Indicator whether the junction is located in an open reading frame.
peptide_context_seq_raw The peptide sequence around the junction including stop codons.
peptide_context The peptide sequence around the junction truncated after stop codons.
truncated_cds Indicator whether the mutated gene product is a truncated from of the WT gene product. If TRUE, peptide_context = NA.
cds_description Descriptor of of the mutated gene product. Can be one of c("mutated cds", "truncated wt cds", "no mutated gene product", "no wt cds", "not in ORF")

If the keep_ranges is TRUE, the following additional columns are added to the output data.frame:

cds_lst A list of GRanges with the original CDS as provided in tx_id column and cds object..
cds_mod_lst A list of GRanges with the modified CDS ranges.

Examples

requireNamespace("BSgenome.Hsapiens.UCSC.hg19", quietly = TRUE)
bsg <- BSgenome.Hsapiens.UCSC.hg19::BSgenome.Hsapiens.UCSC.hg19

add_peptide(toy_junc_df, toy_cds, bsg = bsg, flanking_size = 13)
#> # A tibble: 17 × 18
#>    junc_id          tx_id protein protein_wt frame_shift cds_mod_id junc_pos_cds
#>    <chr>            <chr> <chr>   <chr>      <lgl>       <chr>             <int>
#>  1 chr2:152389996-… ENST… MADDED… MADDEDYEE… TRUE        ENST00000…        16209
#>  2 chr2:152389996-… ENST… MADDED… MADDEDYEE… TRUE        ENST00000…        16314
#>  3 chr2:152389955-… ENST… MADDED… MADDEDYEE… FALSE       ENST00000…        21417
#>  4 chr2:152388410-… ENST… MADDED… MADDEDYEE… FALSE       ENST00000…        16209
#>  5 chr2:152388410-… ENST… MADDED… MADDEDYEE… FALSE       ENST00000…        16314
#>  6 chr2:179415981-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…        83566
#>  7 chr2:179415987-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…        83935
#>  8 chr2:179415000-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…        83566
#>  9 chr2:179445336-… ENST… MTTQAP… MTTQAPTFT… TRUE        ENST00000…        59084
#> 10 chr2:179446225-… ENST… MTTQAP… MTTQAPTFT… TRUE        ENST00000…        59065
#> 11 chr2:179445336-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…        58759
#> 12 chr2:179642044-… ENST… MTTQAP… MTTQAPTFT… TRUE        ENST00000…         4605
#> 13 chr2:179642146-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…         4645
#> 14 chr2:179642044-… ENST… MTTQAP… MTTQAPTFT… FALSE       ENST00000…         4480
#> 15 chr2:152226533-… ENST… NA      NA         NA          NA                   NA
#> 16 chr2:152222731-… ENST… NA      NA         NA          NA                   NA
#> 17 chr2:152388410-… ENST… MADDED… MADDEDYEE… FALSE       ENST00000…        22962
#> # ℹ 11 more variables: protein_junc_pos <dbl>, normalized_cds_junc_pos <int>,
#> #   normalized_protein_junc_pos_pre <dbl>, is_first_reading_frame <lgl>,
#> #   normalized_protein_junc_pos <dbl>, junc_in_orf <lgl>, truncated_cds <lgl>,
#> #   cds_description <chr>, peptide_context_seq_raw <chr>,
#> #   peptide_context_junc_pos <dbl>, peptide_context <chr>