Using k-mers to call HLA alleles in RNA sequencing data

View project on GitHub

What is it?

HLAProfiler uses the k-mer content of next generation sequencing reads to call HLA types in a sample. Based on the k-mer content each each read pair is assigned to an HLA gene and the aggregate k-mer profile for the gene is compared to reference k-mer profiles to determin the HLA type. Currently HLAProfiler only supports paired-end RNA-seq data.

How does it work?

HLAProfiler utilizes the k-mer content of RNA-sequencing reads in two different ways in order to assign HLA types to a sample. First, it utilizes a modifed version of the Kraken taxonomic classifier along with a custom HLA taxonomy to assign reads to a HLA gene. Reads which cannot be assigned exclusively to a single gene are excluded from analysis. Second, HLAProfiler counts the observed k-mers in the gene specific reads and compares the k-mer profile to a database of expected reference k-mer profiles. Using this profile comparison, HLAProfiler identifies the HLA allele pair for each gene with the most supporting evidence in the observed sequence reads. by simulating paired-end reads from each HLA reference allele without errors, filtering these reads based on k-mer content, and tallying k-mer counts of the reads filtering to the expected genes.

Using HLAProfiler requires two steps, 1) database creation and 2) HLA calling.

Database Creation

For each allele, HLAProfiler simulates paired-end reads without errors, assigns each read to an HLA gene and calculates the k-mer profile using reads from the expected gene. As there are thoussand of alleles in the IPD/IMGT reference database, this step is very time intensive. For convenience, a ready-to-use, downloadable database is available from one of three sources:

1) GitHub HLAProfiler database only release. To access the full database, download this GitHub release (including the binaries) and make sure the associated binary files (database.idx and database.kdb) are located in the database directory

2) Q2 Lab Solutions Genomics Bioinformatics website (Request data at bottom of the page)

3) Request to the authors by email

To create a new database, an HLA reference fasta, transcriptome-wide transcript fa and gtf, an exclusion bed, and a hla CWD allele file are required. The transcript files and exclusion bed are used to create the distractome, which helps control for homology between HLA genes and other transcripts. The exclusion bed denotes genomic regions to exclude from the distractome. Any reads assigned to the distractome will be excluded from analysis. While we recommend the IPD/IMGT HLA database and GENCODE as the source of these references, any files can be used as long as they adhere to the following naming and format conventions.

Required Naming Conventions

HLA Reference fasta:

The sequence identifier must follow the format” >IDAlleleName i.e.

HLA00001	A*01:01:01:01
Transcript fasta:

Fields of the sequence identifier must be separated by ‘|’ and one of the fields must match the gene_name tag of the GTF annotation field. i.e.

Transcript GTF:

Standard GTF format. The annotation column must contain the gene_name tag.

HLA Exclusion Bed:

Standard BED format

HLA CWD allele file

This file should list the ID from the reference fasta of alleles that should be considered common and well documented. Under some circumstances where the observed and reference profiles of the top allele do not closely match and the top two allele pairs have similar scores, the final prediction will be weighted in favor of the common and well documented alleles.

Example build commands

Once the proper reference files are downloaded or created the database can be built using the following command:

perl ~/opt/scripts/HLAProfiler/bin/HLAProfiler.pl build -t path/to/transcript.fasta -g /path/to/transcript_annotations.gtf -e /path/to/hla_exclusion.bed -r /path/to/hla.reference.fasta -cwd /path/to/hla_cwd_alleles.txt -o /path/to/output_directory/ -db database_name -kp /path/to/kraken/ -c number_of_threads 

HLA alleles can be processed in parallel and given the thousands of alleles that need processing we recommend specifying multiple threads using the -c option.

For more details on usage run:

perl HLAProfiler.pl build -h

HLA calling:

HLAProfiler can be used for HLA calling by running the following example command:

perl HLAProfiler.pl predict -fastq1 sequencedata_1.fastq.gz -fastq2 sequencedata_2.fastq.gz -database_name database -database_dir /path/to/HLAProfiler -reference /path/to/HLAProfiler/database/data/reference/hla.ref.merged.fa -threads 8 -output_dir /path/to/output/dir -allele_refinement all -kraken_path /path/to/kraken/ -if -l sample.HLAProfiler.log

For more details on usage run:

perl HLAProfiler.pl predict -h


Accurate HLA calling in RNA-seq data, extends the utility and increases the information of this data. HLAProfiler fills the gap of existing tools by improving accuracy for some HLA genes, identifying rare alleles, and correctly identifying novel alleles and alleles with incomplete reference data.


HLAProfiler can be installed on Linux/OSX from the bioconda repository, or using the GitHub release directly. For the smoothest installation, we strongly recommend the bionconda installation.

For the latest and most comprehensive instruction on installing bioconda visit [https://bioconda.github.io/]. For convenience the steps have been summarized here.

Install miniconda

Download the latest version of Miniconda (Python 3 recommended) and run the installer. For more details on miniconda installation visit [https://conda.io/docs/install/quick.html]. After installation be sure to restart the terminal session for your changes to take effect.

bash Miniconda3-latest-MacOSX-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86.sh

Set up channels

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Install HLAProfiler

conda install hlaprofiler

Location of executables

To run HLAProfiler you will need to know the location of the HLAProfiler script as well as the classify executable from Kraken. The HLAProfiler script should be located in the miniconda bin directory (/path/to/minconda/bin/). The classify executable will be found in the kraken-ea directory under the share directory (/path/to/miniconda/share/kraken-ea-version/).

Github installation

For automatic installation, download and unzip the HLAProfiler GitHub release tarball or clone the git repository into the directory where you want the package to be installed (“/path/to/HLAProfiler”)

cd /path/to/HLAProfiler
perl bin/install.pl –d /path/to/install_dir/ -m wget –b /bin/directory/in/PATH/

The installer will retrieve the dependencies and place them in /bin/directory/in/PATH

Location of executables

If using the automated installer kraken (/path/to/kraken) will be located in bin/kraken-version/ under the HLAProfiler directory.


HLAProfiler has modules that allow for testing that the software is working correctly

Test module installation

This test_modules module will test to see all require HLAProfiler modules are accessible by the main script

HLAProfiler.pl test_modules

Unit tests

The test module will start the unit test. Test files will be installed in /path/to/miniconda/usr/local/share/hlaprofiler_tests/ for bioconda installations or /path/to/HLAProfiler/test/

HLAProfiler.pl test –t all –kp /path/to/kraken -td /path/to/test/directory/ -od /path/to/output/directory

Test can also be run on individual modules. To see the list of available modules execute:

HLAProfiler.pl test -h


Martin Buchkovich (martin.buchkovich@q2labsolutions.com)


Use of this software is limited to non-commercial use under Q Square Solutions Expression Analysis user license. See the LICENSE file for more details.