How to match variants across reference panels and target genomes

pgscatalog-intersect is a CLI application that makes it easy to match variants (same strand) between a set of reference and target data .pvar/bim files. The application:

  • Uses allelic frequency (afreq) and variant missingness (vmiss) to evaluate whether the variants in the TARGET are suitable for inclusion in a PCA analysis

  • Filters matches on strand ambiguity and multi-allelic/INDEL status

  • Outputs a tab delimited report describing each intersected variant

Installation

$ pip install pgscatalog-match

Usage

Matching HGDP vs 1000 Genomes

To intersect variants in HGDP against 1000 Genomes:

$ pgscatalog-intersect --ref GRCh38_1000G_ALL.pvar.zst \
    --target GRCh38_hgdp_5.pvar.zst \
    --chrom 5 \
    --maf_target 0.1 \
    --geno_miss 0.1 \
    --outdir . \
    -v

You’ll also need gzipped afreq and vmiss files for the target genome in the same directory.

The output is a tab delimited text file structured to contain one variant per row, with the following columns:

Column name

Description

CHR:POS:A0:A1

A colon delimited variant ID, consisting of chromosome, position, a0 and a1 alleles (like REF / ALT)

ID_REF

The ID of the reference variant

REF_REF

The REF allele of the reference variant

IS_INDEL

Is the reference variant an indel?

STRANDAMB

Is the reference variant strand ambiguous?

IS_MA_REF

Is the reference variant multi-allelic?

ID_TARGET

The ID of the matched variant in the target genome

REF_TARGET

The REF allele of the target variant

IS_MA_TARGET

Is the target variant multi-allelic?

AAF

Allele frequency

F_MISS_DOSAGE

Missing dosage rate

SAME_REF

Do the reference and target variant have the same reference allele?

PCA_ELIGIBLE

Is the variant eligible for PCA inclusion?

You can use the table to extract IDs and then use something like plink2 to extract a subset of variants.

Help

$ pgscatalog-intersect --help