Download Resources
The "Download Resources" subworkflow downloads resources required for common bioinformatics analyses. The resources are downloaded from the providers listed below. Each provider has its own license and citation requirements - please review and acknowledge the original sources when using an OBLX generated library.
Note: OBLX downloads Twist Exome BED files from UCSC, these do not fall under an open source license, please check for your use case.
| Provider | Download URL | Citation |
|---|---|---|
| GENCODE (https://www.gencodegenes.org/) | https://ftp.ebi.ac.uk/pub/databases/gencode | Mudge et al. (2025) |
| UCSC (https://genome.ucsc.edu/) | https://hgdownload.soe.ucsc.edu | Casper et al. (2026) |
| GATK / Broad resource bundle (https://gatk.broadinstitute.org/) | https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0 | Van der Auwera et al. (2020) |
| gnomAD (https://gnomad.broadinstitute.org/) | https://storage.googleapis.com/gcp-public-data--gnomad/release | Chen et al. (2024), Karczewski et al. (2020) |
| UniProt (https://www.uniprot.org/) | https://rest.uniprot.org/uniprotkb/stream | The UniProt Consortium (2025) |
| NCBI (https://www.ncbi.nlm.nih.gov/snp/) | https://ftp.ncbi.nih.gov/ | Phan et al. (2025) |
| Ensembl (https://www.ensembl.org) | https://ftp.ensembl.org/pub/ | Keane et al. (2011) |
| Genbank (https://www.ncbi.nlm.nih.gov/genbank/) | download via efetch with Genbank accessions specified in https://github.com/TRON-Bioinformatics/oblx/blob/dev/workflow/resources/tcga_viruses.tsv | Clark et al. (2015) |
Input
No input is required. However, the organism and releases of individual resources can be specified in the config file.
Usage
To run the pull resources subworkflow, run the following command.
snakemake --until pull_resources \
--directory </path/to/output/directory> \
--software-deployment-method [conda|apptainer] \
--latency-wait 60 \
[--configfile <path/to/config/file>] \
[--profile </path/to/cluster/profile/>]
--directory: Directory to store the results of the workflow.--software-deployment-method: Eithercondaorapptainer. Container images for apptainer are configured inconfig/container_config.yaml.--latency-wait: Wait for e.g. 60 seconds for files to be created due to IO latency--configfile(optional): Defines e.g. the reference genome version that should be used, see Configuration--profile(optional): Specify cluster profile to submit jobs e.g. to a HPC
Output
The pull resources step gathers all files that are required for index generation or that are directly used by downstream tools. An overview of all downloaded/generated resources is given for human- and mouse-mode.
The workflow generates the following directory structure (in human mode):
</path/to/output/dir>/resources
├── exome_definition
│ ├── ref_cds.bed
│ ├── ref_exome.bed
│ ├── ref_exome.bed.gz
│ ├── ref_exome.bed.gz.tbi
│ ├── twist_comprehensive_exome.bed
│ ├── twist_core_exome.bed
│ ├── twist_exome2.bed
│ └── twist_refseq.bed
├── gatk_bundle
│ ├── 1000G_omni2.5.hg38.vcf.gz
│ ├── 1000G_omni2.5.hg38.vcf.gz.tbi
│ ├── 1000G_phase1.snps.high_confidence.hg38.vcf.gz
│ ├── 1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi
│ ├── hapmap_3.3.hg38.vcf.gz
│ ├── hapmap_3.3.hg38.vcf.gz.tbi
│ ├── Homo_sapiens_assembly38.dbsnp138.vcf.gz
│ ├── Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi
│ ├── Homo_sapiens_assembly38.known_indels.vcf.gz
│ ├── Homo_sapiens_assembly38.known_indels.vcf.gz.tbi
│ ├── Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
│ └── Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi
├── germline_variants
│ ├── gnomAD
│ │ └── {exomes,genomes}
│ │ ├── af_only_gnomad_hg38.vcf.gz
│ │ ├── af_only_gnomad_hg38.vcf.gz.tbi
│ │ ├── common_biallelic_chr1.vcf.gz
│ │ └── common_biallelic_chr1.vcf.gz.tbi
│ ├── dbSNP_151.vcf.gz
│ └── dbSNP_151.vcf.gz.tbi
├── mappability
│ ├── encode_exclusion.bed
│ ├── grcExclusions.bed
│ └── ucsc_problematic.bed
├── chromosome_sizes.txt
├── ref_annot.bed
├── ref_annot.gtf
├── ref_annot_gene2symbol.tsv
├── ref_annot_splice_sites.tsv
├── ref_annot_transcript2gene.tsv
├── ref_annot_metadata_SwissProt.tsv
├── ref_annot_metadata_TrEMBL.tsv
├── ref_genome_primary.fasta
├── ref_genome_grc_masked.fasta
├── ref_genome_masked_final.fasta
├── ref_genome.fasta
├── ref_genome.fasta.fai
├── ref_genome.dict
├── ref_transcripts.fasta
├── ref_genome_repeatmasker.bed
├── uniprot
│ └── uniprot_annotations.tsv
└── viruses
└── tcga_virus_decoy.fasta
Gencode reference files
The following files are downloaded directly from Gencode (Mudge et al., 2025). In human mode, problematic regions defined by GRC are hard masked in the reference fasta while repetitive regions are not masked.
chromosome_sizes.txt: Lengths of the chromosomesref_annot.gtf: Comprehensive gene annotation based on primary assembly (PRI) (gencode.v<release>.primary_assembly.annotation.gtf.gz)ref_annot.bed: BED12 file of the transcripts (transformed from GTF file)ref_genome.fasta: Symlink to the primary assembly reference genome fasta. When pull_resources is run in human mode, the symlink points to the masked genome (masking is based onresources/mappability/grcExclusions.bedwhich contains a set of regions that have been flagged by the GRC to contain false duplications or contamination sequences (Behera et al., 2022), downloaded from UCSC, see section Mappability). Additionally in human mode, pseudoautosomal regions (defined inworkflow/resources/GRCh38_pseudoautosomal_regions.bedfrom Ensembl) are hard masked. If pull_resources is run in mouse mode, the symlink points to the primary assembly (ref_genome_primary.fasta).ref_genome_grc_masked.fasta(Only given in human mode): Based on the primary assembly, problematic regions defined by GRC are hard masked (e.g. false duplications and contaminations, see GIAB readme)ref_genome_masked_final.fasta(Only given in human mode): Based onref_genome_grc_masked.fastafile, pseudoautosomal regions (defined inworkflow/resources/GRCh38_pseudoautosomal_regions.bedfrom Ensembl) are masked.ref_annot_metadata_SwissProt.tsv: UniProtKB/SwissProt entry associated to the transcript (from Ensembl xref pipelinegencode.v<release>.metadata.SwissProt.gz)ref_annot_metadata_TrEMBL.tsv: UniProtKB/TrEMBL entry associated to the transcript (from Ensembl xref pipelinegencode.v<release>.metadata.TrEMBL.gz)ref_genome_primary.fasta: Primary (PRI) assembly (GRC<build>.primary_assembly.genome.fa.gz)ref_transcripts.fasta: Transcript sequences (gencode.v<release>.transcripts.fa.gz)ref_annot_transcript2gene.tsv: Translation of transcript ID to gene ID (transformed from GTF file)ref_annot_gene2symbol.tsv: Translation of gene ID to gene symbol (gencode.v<release>.metadata.HGNC.gz)ref_annot_splice_sites.tsv: Splice sites of reference transcripts generated from the GTF (see splice2neo)
UCSC repeatmasker regions
The RepeatMasker annotated regions are downloaded.
Note: We do not mask the repetitive regions in the ref_genome.fasta file
ref_genome_repeatmasker.bed: Repeat masker regions from UCSC golden path translated into BED format.
Exome definition
The exome definition directory contains exonic and CDS (coding sequence) BED files. These files can be used e.g. to restrict specific variant callers (e.g. Mutect2) to only consider the specified regions for variant calling. In human and mouse mode the following files can be found in this directory:
ref_cds.bed: The CDS (coding sequence) intervals derived from the annotation GTF. CDS regions are merged. Based on the DeepVariant RNA-seq variant calling tutorial.ref_exome.bed: The exonic intervals extended byintron_slop(default: 20) bases defined in the config file. The file was generated by selecting the exons from the annotation GTF with the tag defined in exome_transcript_definition.
In human mode, the Twist exome bed files are additionally in this directory:
twist_refseq.bed: Exome capture kitTwist_Exome_RefSeq_targets_hg38.bbdownloaded from UCSC exomeProbesets transformed into bed file usingucsc-bigbedtobedversion 469twist_core_exome.bed: Exome capture kitTwist_Exome_Target_hg38.bbdownloaded from UCSC exomeProbesets transformed into bed file usingucsc-bigbedtobedversion 469twist_comprehensive_exome.bed: Exome capture kitTwist_ComprehensiveExome_targets_hg38.bbdownloaded from UCSC exomeProbesets transformed into bed file usingucsc-bigbedtobedversion 469twist_exome2.bed: Exome capture kitTwistExome21.bbdownloaded from UCSC exomeProbesets transformed into bed file usingucsc-bigbedtobedversion 469
GATK bundle
The GATK bundle is only available for human and thus only present in the output
when pull_resources is run in human mode. The URL to the GATK bundle for
download can be specified in the config file via the parameter gatk_url and is
by default the public Google cloud bucket:
https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0.
The following files are downloaded:
1000G_omni2.5.hg38.vcf.gz1000G_phase1.snps.high_confidence.hg38.vcf.gzhapmap_3.3.hg38.vcf.gzHomo_sapiens_assembly38.dbsnp138.vcf.gzHomo_sapiens_assembly38.known_indels.vcf.gzMills_and_1000G_gold_standard.indels.hg38.vcf.gz
For all VCF files, the index is created using tabix v1.11.
Germline variants
The germline variants are downloaded from
gnomAD and processed and are only
available when pull_resources was run in human mode. The gnomAD version has to
be specified in the config file (see Configuration). Default
is v4.1 and tests were done for v4.1. For every file an index is created with
tabix v1.11. For gnomAD, the files af_only_gnomad_hg38.vcf.gz and
common_biallelic_chr1.vcf.gz are both generated for exomes and genomes and are
found in the respective subdirectory.
af_only_gnomad_hg38.vcf.gz: gnomAD exome variants annotated only with population allele frequency for Mutect2- This file was created by downloading all chromosomes of gnomAD exome
variants. Subsequently variants with a population allele frequency higher
than
minimum_allele_frequency, defined in the config (see Configuration, default 0.001) are filtered forPASS. Finally all annotations are removed and only allele frequency is kept. - This file should be used as
--germline-resourcewhen running Mutect2. common_biallelic_chr1.vcf.gz: Common germline variant sites VCF for GetPileupSummaries- This file was created from exonic variants on chromosome 1 from GNOMAD.
These variants are filtered for
AF > 0.05,--max-alleles 2andPASS(these filters are described in the Mutect2 best practices workflow where "variants_for_contamination" is described) dbSNP_151.vcf.gz: dbSNP from NCBI FTP server. Currently only version 151 is supported for human.- This file is downloaded from NCBI FTP and ENSEMBl chromosome names (without chr) are translated to Gencode chromosome names (with chr) using chromosome mappings.
- In mouse mode, the dbSNP file is named
dbSNP_mouse.vcf.gzand is downloaded from Ensembl FTP for the matching Ensembl version to the config specified GENCODE version. The chromosome names are adjusted to GENCODE convention via https://github.com/dpryan79/ChromosomeMappings.
Mappability
Mappability contains bed files with regions that are complicated to map with
short reads. The files are only downloaded when pull_resources is run in human
mode. These files are downloaded from
UCSC. All files are
transformed to bed format using ucsc-bigbedtobed v469.
encode_exclusion.bed: FileencBlacklist.bbtransformed to bed format.grcExclusions.bed: FilegrcExclusions.bbtransformed to bed format.ucsc_problematic.bed: Filecomments.bbtransformed to bed format.
UniProt
The file resources/uniprot/uniprot_annotations.tsv contains data fetched from
https://rest.uniprot.org/uniprotkb/stream for the fields specified in
workflow/scripts/programmatically_get_uniprot.py. The table contains the
column transcript_id which contains the GENCODE identifiers retrieved from the
TrEMBL and SwissProt mappings fetched from GENCODE
(resources/ref_annot_metadata_{TrEMBL,SwissProt}.tsv).
Viruses
Genome sequences of common cancer related viruses in fasta format that can be
used for contamination detection and profiling. The list of viruses
(workflow/resources/tcga_viruses.tsv) was downloaded from
TCGA.
The viral decoy sequences listed in this file can be appended to the reference
genome (resources/ref_genome.fasta) to enable investigation of reads mapping to
viral sequences. If this is needed, all tool-specific indices that depend on the
reference FASTA (e.g., aligner index) must be rebuilt from the updated file.
Viral sequences are only downloaded in human mode.
Overview of downloaded resources in human-mode
| Path in OBLX Library | Origin | Short description |
|---|---|---|
| resources/chromosome_sizes.txt | GENCODE | Chromosome sizes file for the resources/ref_genome.fasta file. |
| resources/exome_definition/ref_cds.bed | GENCODE | Reference CDS BED containing all CDS regions specified in the GTF. |
| resources/exome_definition/ref_exome.bed | GENCODE | Reference exome BED containing all exonic regions specified in the GTF with N positions padded left and right (defined with config parameter intron_slop). |
| resources/exome_definition/ref_exome.bed.gz | GENCODE | Bgzipped file of ref_exome.bed. |
| resources/exome_definition/ref_exome.bed.gz.tbi | GENCODE | Tabix of ref_exome.bed. |
| resources/exome_definition/twist_comprehensive_exome.bed | UCSC | Twist comprehensive exome definition BED. |
| resources/exome_definition/twist_core_exome.bed | UCSC | Twist core exome definition BED. |
| resources/exome_definition/twist_exome2.bed | UCSC | Twist Exome2 definition BED. |
| resources/exome_definition/twist_refseq.bed | UCSC | Twist RefSeq exome definition BED. |
| resources/gatk_bundle/1000G_omni2.5.hg38.vcf.gz | GATK | 1000 Genomes Omni 2.5 SNP resource VCF. |
| resources/gatk_bundle/1000G_omni2.5.hg38.vcf.gz.tbi | GATK | Tabix index for the 1000 Genomes Omni 2.5 VCF. |
| resources/gatk_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz | GATK | 1000 Genomes high-confidence SNP resource VCF. |
| resources/gatk_bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi | GATK | Tabix index for the 1000 Genomes high-confidence SNP VCF. |
| resources/gatk_bundle/hapmap_3.3.hg38.vcf.gz | GATK | HapMap 3.3 SNP resource VCF. |
| resources/gatk_bundle/hapmap_3.3.hg38.vcf.gz.tbi | GATK | Tabix index for the HapMap 3.3 VCF. |
| resources/gatk_bundle/Homo_sapiens_assembly38.dbsnp138.vcf.gz | GATK | GATK dbSNP 138 VCF. |
| resources/gatk_bundle/Homo_sapiens_assembly38.dbsnp138.vcf.gz.tbi | GATK | Tabix index for the GATK dbSNP 138 VCF. |
| resources/gatk_bundle/Homo_sapiens_assembly38.known_indels.vcf.gz | GATK | GATK known indels VCF. |
| resources/gatk_bundle/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi | GATK | Tabix index for the GATK known indels VCF. |
| resources/gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz | GATK | GATK gold-standard indel resource VCF. |
| resources/gatk_bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi | GATK | Tabix index for the GATK gold-standard indel VCF. |
| resources/germline_variants/dbSNP_151.vcf.gz | dbSNP | dbSNP 151 germline variant VCF. |
| resources/germline_variants/dbSNP_151.vcf.gz.tbi | dbSNP | Tabix index for the dbSNP 151 VCF. |
| resources/germline_variants/gnomAD/exomes/af_only_gnomad_hg38.vcf.gz | GnomAD | GnomAD exomes allele-frequency-only VCF. |
| resources/germline_variants/gnomAD/exomes/af_only_gnomad_hg38.vcf.gz.tbi | GnomAD | Tabix index for the GnomAD exomes VCF. |
| resources/germline_variants/gnomAD/exomes/common_biallelic_chr1.vcf.gz | GnomAD | Common biallelic chr1 variants from GnomAD exomes. |
| resources/germline_variants/gnomAD/exomes/common_biallelic_chr1.vcf.gz.tbi | GnomAD | Tabix index for the GnomAD exomes chr1 VCF. |
| resources/germline_variants/gnomAD/genomes/af_only_gnomad_hg38.vcf.gz | GnomAD | GnomAD genomes allele-frequency-only VCF. |
| resources/germline_variants/gnomAD/genomes/af_only_gnomad_hg38.vcf.gz.tbi | GnomAD | Tabix index for the GnomAD genomes VCF. |
| resources/germline_variants/gnomAD/genomes/common_biallelic_chr1.vcf.gz | GnomAD | Common biallelic chr1 variants from GnomAD genomes. |
| resources/germline_variants/gnomAD/genomes/common_biallelic_chr1.vcf.gz.tbi | GnomAD | Tabix index for the GnomAD genomes chr1 VCF. |
| resources/mappability/encode_exclusion.bed | UCSC | ENCODE exclusion regions BED (downloaded from UCSC). |
| resources/mappability/grcExclusions.bed | UCSC | GRC exclusion regions BED (downloaded from UCSC) used to mask the reference genome in human mode. |
| resources/mappability/ucsc_problematic.bed | UCSC | UCSC problematic regions BED (downloaded from UCSC). |
| resources/ref_annot.bed | GENCODE | BED12 transcript annotation derived from the GTF. |
| resources/ref_annot.gtf | GENCODE | Comprehensive gene annotation GTF based on the primary assembly. |
| resources/ref_annot_gene2symbol.tsv | GENCODE | Gene-to-symbol mapping table. |
| resources/ref_annot_metadata_SwissProt.tsv | GENCODE | UniProtKB/SwissProt entry associated to the transcript (from Ensembl xref pipeline) |
| resources/ref_annot_metadata_TrEMBL.tsv | GENCODE | UniProtKB/TrEMBL entry associated to the transcript (from Ensembl xref pipeline) |
| resources/ref_annot_splice_sites.tsv | GENCODE | Splice sites of reference transcripts generated from the GTF (see https://github.com/TRON-Bioinformatics/splice2neo) |
| resources/ref_annot_transcript2gene.tsv | GENCODE | Transcript-to-gene mapping table. |
| resources/ref_genome.dict | GENCODE | Sequence dictionary for the reference genome (generated with GATK CreateSequenceDictionary). |
| resources/ref_genome.fasta | GENCODE | Symlink to the final reference genome FASTA (masked in human mode, primary assembly in mouse mode). |
| resources/ref_genome.fasta.fai | GENCODE | Index for the reference genome FASTA. |
| resources/ref_genome_grc_masked.fasta | GENCODE | Primary assembly with GRC-flagged problematic regions hard-masked (human only). |
| resources/ref_genome_masked_final.fasta | GENCODE | GRC-masked assembly with pseudoautosomal regions additionally hard-masked (human only). |
| resources/ref_genome_primary.fasta | GENCODE | Primary assembly genome FASTA downloaded from Gencode. |
| resources/ref_genome_repeatmasker.bed | UCSC | RepeatMasker-derived genome repeat regions BED downloaded from UCSC. |
| resources/ref_transcripts.fasta | GENCODE | Transcript sequences FASTA derived from Gencode resources. |
| resources/uniprot/uniprot_annotations.tsv | UniProt | Combined UniProt annotation table with current status (directly downloaded from UniProt's rest API) of SwissProt and TrEMBL. The transcript_id column maps to the transcript identifiers specified in the ref_annot.gtf file. |
| resources/viruses/tcga_virus_decoy.fasta | TCGA/GenBank | Viral decoy FASTA used for contamination/decoy-aware analyses. |
Overview of downloaded resources in mouse-mode
| Path in OBLX Library | Origin | Short description |
|---|---|---|
| resources/chromosome_sizes.txt | GENCODE | Chromosome sizes file for the resources/ref_genome.fasta file. |
| resources/exome_definition/ref_cds.bed | GENCODE | Reference CDS BED containing all CDS regions specified in the GTF. |
| resources/exome_definition/ref_exome.bed | GENCODE | Reference exome BED containing all exonic regions specified in the GTF with N positions padded left and right (defined with config parameter intron_slop). |
| resources/exome_definition/ref_exome.bed.gz | GENCODE | Bgzipped file of ref_exome.bed. |
| resources/exome_definition/ref_exome.bed.gz.tbi | GENCODE | Tabix of ref_exome.bed. |
| resources/germline_variants/dbSNP_mouse.vcf.gz | dbSNP | dbSNP germline variant VCF. |
| resources/germline_variants/dbSNP_mouse.vcf.gz.tbi | dbSNP | Tabix index for the dbSNP VCF. |
| resources/ref_annot.bed | GENCODE | BED12 transcript annotation derived from the GTF. |
| resources/ref_annot.gtf | GENCODE | Comprehensive gene annotation GTF based on the primary assembly. |
| resources/ref_annot_gene2symbol.tsv | GENCODE | Gene-to-symbol mapping table. |
| resources/ref_annot_metadata_SwissProt.tsv | GENCODE | UniProtKB/SwissProt entry associated to the transcript (from Ensembl xref pipeline) |
| resources/ref_annot_metadata_TrEMBL.tsv | GENCODE | UniProtKB/TrEMBL entry associated to the transcript (from Ensembl xref pipeline) |
| resources/ref_annot_splice_sites.tsv | GENCODE | Splice sites of reference transcripts generated from the GTF (see https://github.com/TRON-Bioinformatics/splice2neo) |
| resources/ref_annot_transcript2gene.tsv | GENCODE | Transcript-to-gene mapping table. |
| resources/ref_genome.dict | GENCODE | Sequence dictionary for the reference genome (generated with GATK CreateSequenceDictionary). |
| resources/ref_genome.fasta | GENCODE | Symlink to the primary assembly reference genome FASTA (in mouse mode). |
| resources/ref_genome.fasta.fai | GENCODE | Index for the reference genome FASTA. |
| resources/ref_genome_primary.fasta | GENCODE | Primary assembly genome FASTA downloaded from Gencode. |
| resources/ref_genome_repeatmasker.bed | UCSC | RepeatMasker-derived genome repeat regions BED downloaded from UCSC. |
| resources/ref_transcripts.fasta | GENCODE | Transcript sequences FASTA derived from Gencode resources. |
| resources/uniprot/uniprot_annotations.tsv | UniProt | Combined UniProt annotation table with current status (directly downloaded from UniProt's rest API) of SwissProt and TrEMBL. The transcript_id column maps to the transcript identifiers specified in the ref_annot.gtf file. |
References
- Behera, S., LeFaive, J., Orchard, P., Mahmoud, M., Paulin, L. F., Farek, J., Soto, D. C., Parker, S. C. J., Smith, A. V., Dennis, M. Y., Zook, J. M., & Sedlazeck, F. J. (2022). Fixing reference errors efficiently improves sequencing results. Genomics. https://doi.org/10.1101/2022.07.18.500506
- Casper, J., Speir, M. L., Raney, B. J., Perez, G., Nassar, L. R., Lee, C. M., Hinrichs, A. S., Gonzalez, J. N., Fischer, C., Diekhans, M., Clawson, H., Benet-Pages, A., Barber, G. P., Vaske, C. J., van Baren, M. J., Wang, K., Rodriguez, Y. J. P., Jenkins-Kiefer, J. A., Chalamala, M., … Haeussler, M. (2026). The UCSC Genome Browser database: 2026 update. Nucleic Acids Research, 54(D1), D1331–D1335. https://doi.org/10.1093/nar/gkaf1250
- Chen, S., Francioli, L. C., Goodrich, J. K., Collins, R. L., Kanai, M., Wang, Q., Alföldi, J., Watts, N. A., Vittal, C., Gauthier, L. D., Poterba, T., Wilson, M. W., Tarasova, Y., Phu, W., Grant, R., Yohannes, M. T., Koenig, Z., Farjoun, Y., Banks, E., … Karczewski, K. J. (2024). A genomic mutational constraint map using variation in 76,156 human genomes. Nature, 625(7993), 92–100. https://doi.org/10.1038/s41586-023-06045-0
- Karczewski, K. J., Francioli, L. C., Tiao, G., Cummings, B. B., Alföldi, J., Wang, Q., Collins, R. L., Laricchia, K. M., Ganna, A., Birnbaum, D. P., Gauthier, L. D., Brand, H., Solomonson, M., Watts, N. A., Rhodes, D., Singer-Berk, M., England, E. M., Seaby, E. G., Kosmicki, J. A., … MacArthur, D. G. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809), 434–443. https://doi.org/10.1038/s41586-020-2308-7
- Mudge, J. M., Carbonell-Sala, S., Diekhans, M., Martinez, J. G., Hunt, T., Jungreis, I., Loveland, J. E., Arnan, C., Barnes, I., Bennett, R., Berry, A., Bignell, A., Cerdán-Vélez, D., Cochran, K., Cortés, L. T., Davidson, C., Donaldson, S., Dursun, C., Fatima, R., … Frankish, A. (2025). GENCODE 2025: Reference gene annotation for human and mouse. Nucleic Acids Research, 53(D1), D966–D975. https://doi.org/10.1093/nar/gkae1078
- Phan, L., Zhang, H., Wang, Q., Villamarin, R., Hefferon, T., Ramanathan, A., & Kattman, B. (2025). The evolution of dbSNP: 25 years of impact in genomic research. Nucleic Acids Research, 53(D1), D925–D931. https://doi.org/10.1093/nar/gkae977
- The UniProt Consortium, Bateman, A., Martin, M.-J., Orchard, S., Magrane, M., Adesina, A., Ahmad, S., Bowler-Barnett, E. H., Bye-A-Jee, H., Carpentier, D., Denny, P., Fan, J., Garmiri, P., Gonzales, L. J. D. C., Hussein, A., Ignatchenko, A., Insana, G., Ishtiaq, R., Joshi, V., … Zhang, J. (2025). UniProt: The Universal Protein Knowledgebase in 2025. Nucleic Acids Research, 53(D1), D609–D617. https://doi.org/10.1093/nar/gkae1010
- van der Auwera, G., & O’Connor, B. D. (2020). Genomics in the Cloud: Using Docker, GATK, and WDL in Terra. O’Reilly Media, Incorporated. https://books.google.de/books?id=wwiCswEACAAJ