Build Indices
The build indices subworkflow generates indices for the following bioinformatics tools. It builds the tool indices based on the previously pulled resources. An extensive list of all supported bioinformatics tools can be found here.
Note: When using the generated indices, it is essential to ensure that the versions of the tools used in your analysis match the versions of the tools that were used to create the indices. Mismatched versions may lead to errors or inconsistent results. The versions for each tool can be found in the respective environment yaml file in
workflow/envs.
Input
The build indices workflow is run after pull resources.
Therefore, the output of the pull resources workflow has to be present in the
directory, where the indices should be generated (snakemake command line
parameter --directory has to point to a directory that was created with the
pull_resources workflow).
Usage
To run the build indices workflow run the following command.
snakemake --until build_indices \
--directory </path/to/output/directory> \
--software-deployment-method [conda|apptainer] \
--latency-wait 60 \
[--configfile <path/to/config/file>] \
[--profile </path/to/cluster/profile/>]
--directory: Specifies the directory where the results of the workflow should be stored.--software-deployment-method: Eithercondaorapptainer. Container images for apptainer are configured inconfig/container_config.yaml.--latency-wait: Wait for e.g. 60 seconds for files to be created due to IO latency--configfile(optional): Defines e.g. the reference genome version that should be used, see Configuration--profile(optional): Specify cluster profile to submit jobs e.g. to a HPC
Output
The output of the build_indices workflow creates the indices directory next to
the resources directory, created by the pull_resources
workflow. The following directory structure is being created:
</path/to/output/directory>
├── indices
│ ├── bowtie2
│ │ ├── genome.1.bt2
│ │ ├── genome.2.bt2
│ │ ├── genome.3.bt2
│ │ ├── genome.4.bt2
│ │ ├── genome.rev.1.bt2
│ │ └── genome.rev.2.bt2
│ ├── bwa_mem
│ │ ├── ref_genome.fasta -> ../../resources/ref_genome_masked_final.fasta
│ │ ├── ref_genome.fasta.amb
│ │ ├── ref_genome.fasta.ann
│ │ ├── ref_genome.fasta.bwt
│ │ ├── ref_genome.fasta.fai
│ │ ├── ref_genome.fasta.pac
│ │ └── ref_genome.fasta.sa
│ ├── bwa_mem2
│ │ ├── ref_genome.fasta -> ../../resources/ref_genome_masked_final.fasta
│ │ ├── ref_genome.fasta.0123
│ │ ├── ref_genome.fasta.amb
│ │ ├── ref_genome.fasta.ann
│ │ ├── ref_genome.fasta.bwt.2bit.64
│ │ ├── ref_genome.fasta.fai
│ │ └── ref_genome.fasta.pac
│ ├── hisat2
│ │ ├── genome.1.ht2
│ │ ├── genome.2.ht2
│ │ ├── genome.3.ht2
│ │ ├── genome.4.ht2
│ │ ├── genome.5.ht2
│ │ ├── genome.6.ht2
│ │ ├── genome.7.ht2
│ │ ├── genome.8.ht2
│ │ ├── genome.exon
│ │ ├── genome.haplotype
│ │ ├── genome.snp
│ │ └── genome.ss
│ ├── kallisto
│ │ ├── ref_cdna.fa
│ │ ├── ref_transcript.idx
│ │ └── ref_transcript_to_gene.tsv
│ ├── R
│ │ ├── ref_annot_txdb.sqlite
│ │ ├── ref_cds.Rds
│ │ ├── ref_genome.2bit
│ │ ├── ref_transcript_ranges.Rds
│ │ └── ref_transcripts.Rds
│ ├── salmon
│ │ ├── decoys.txt
│ │ ├── gentrome.fasta
│ │ ├── transcriptome_index
│ │ │ ├── complete_ref_lens.bin
│ │ │ └── ...
│ │ └── requant_index
│ │ └── transcripts.fa
│ ├── snpeff
│ │ ├── data
│ │ │ └── GRCh38.<release>
│ │ │ ├── genes.gtf -> ../../../../resources/ref_annot.gtf
│ │ │ ├── sequences.fa -> ../../../../resources/ref_genome_masked_final.fasta
│ │ │ └── snpEffectPredictor.bin
│ │ └── snpeff.config
│ └── star
│ ├── chrLength.txt
│ ├── chrNameLength.txt
│ ├── chrName.txt
│ ├── chrStart.txt
│ ├── exonGeTrInfo.tab
│ ├── exonInfo.tab
│ ├── geneInfo.tab
│ ├── Genome
│ ├── genomeParameters.txt
│ ├── Log.out
│ ├── SA
│ ├── SAindex
│ ├── sjdbInfo.txt
│ ├── sjdbList.fromGTF.out.tab
│ ├── sjdbList.out.tab
│ └── transcriptInfo.tab
└── resources
bowtie2
Contains the index for bowtie2 (Langmead, B., & Salzberg, S. L., 2012) (based on the masked reference genome in human mode, see Gencode reference files).
hisat2
Contains the index for hisat2 (Kim et al., 2019) (based on the masked reference genome in human mode, see Gencode reference files). The index was generated with the respective dbSNP VCF file (see Germline Variants). Due to the large amount of SNPs available for mouse, we currently do not provide a hisat2 index for this organism (See #174 for more on this).
bwa_mem / bwa_mem2
The bwa_mem (Li, H., & Durbin, R., 2009) and bwa_mem2 (Vasimuddin et al., 2019) directories contain the respective indices and a symlink to the reference genome fasta file (if the genome is masked, in case of human, this symlink points to the masked reference genome fasta, see Gencode reference files).
snpEff
This directory contains the resources required to run snpEff (Cingolani, et al., 2012) predictor.
How to use the created resources to run snpEff?
The file snpeff.config has to be passed to snpEff with the command line option
-c when running snpEff. Additionally, the database name matching the subfolder
under indices/snpeff/data (e.g. GRCh38.49) must be passed as a positional
argument, and -nodownload must be set so snpEff does not try to fetch the
database from the internet.
Example usage
snpEff \
-stats <path_to_stats_outfile> \
-csvStats <path_to_stats_outcsvfile> \
-c <path_to_generated_snpeff_config> \
-nodownload \
<genome_build>.<release> \
<path_to_vcf_file>
R
This directory contains GenomicFeatures representations of annotation data for use with splice2neo (Lang et al., 2024).
Star
The STAR (Dobin et al., 2013) directory contains the STAR index. The path to the
STAR directory has to be set as --genomeDir parameter, when running STAR
mapping.
Salmon
Contains the Salmon
(Patro et al., 2017) index. The transcriptome_index subdirectory holds the
main gentrome-based index for quantification. The requant_index/transcripts.fa
file is a transcript-only FASTA derived from the annotation and reference genome
via gffread, intended for re-quantification workflows that require an
annotation-consistent transcript sequence.
Kallisto
Contains the Kallisto (Bray et al., 2016) index.
References
- Cingolani, P., Platts, A., Wang, L. L., Coon, M., Nguyen, T., Wang, L., Land, S. J., Lu, X., & Ruden, D. M. (2012). A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly, 6(2), 80–92. https://doi.org/10.4161/fly.19695
- Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15–21. https://doi.org/10.1093/bioinformatics/bts635
- Kim, D., Paggi, J. M., Park, C., Bennett, C., & Salzberg, S. L. (2019). Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology, 37(8), 907–915. https://doi.org/10.1038/s41587-019-0201-4
- Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. https://doi.org/10.1038/nmeth.1923
- Lang, F., Sorn, P., Suchan, M., Henrich, A., Albrecht, C., Köhl, N., Beicht, A., Riesgo-Ferreiro, P., Holtsträter, C., Schrörs, B., Weber, D., Löwer, M., Sahin, U., & Ibn-Salem, J. (2024). Prediction of tumor-specific splicing from somatic mutations as a source of neoantigen candidates. Bioinformatics Advances, 4(1), vbae080. https://doi.org/10.1093/bioadv/vbae080
- Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760. https://doi.org/10.1093/bioinformatics/btp324
- Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197
- Bray, N. L., Pimentel, H., Melsted, P., & Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology, 34(5), 525–527. https://doi.org/10.1038/nbt.3519
- Vasimuddin, Md., Misra, S., Li, H., & Aluru, S. (2019). Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems. 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 314–324. https://doi.org/10.1109/IPDPS.2019.00041