How to match variants across reference panels and target genomes¶
pgscatalog-intersect is a CLI application that makes it easy to match variants (same strand) between a set of reference and target data .pvar/bim files. The application:
Uses allelic frequency (afreq) and variant missingness (vmiss) to evaluate whether the variants in the TARGET are suitable for inclusion in a PCA analysis
Filters matches on strand ambiguity and multi-allelic/INDEL status
Outputs a tab delimited report describing each intersected variant
Installation¶
$ pip install pgscatalog-match
Usage¶
Matching HGDP vs 1000 Genomes¶
To intersect variants in HGDP against 1000 Genomes:
$ pgscatalog-intersect --ref GRCh38_1000G_ALL.pvar.zst \
--target GRCh38_hgdp_5.pvar.zst \
--chrom 5 \
--maf_target 0.1 \
--geno_miss 0.1 \
--outdir . \
-v
You’ll also need gzipped afreq and vmiss files for the target genome in the same directory.
The output is a tab delimited text file structured to contain one variant per row, with the following columns:
Column name |
Description |
|---|---|
CHR:POS:A0:A1 |
A colon delimited variant ID, consisting of chromosome, position, a0 and a1 alleles (like REF / ALT) |
ID_REF |
The ID of the reference variant |
REF_REF |
The REF allele of the reference variant |
IS_INDEL |
Is the reference variant an indel? |
STRANDAMB |
Is the reference variant strand ambiguous? |
IS_MA_REF |
Is the reference variant multi-allelic? |
ID_TARGET |
The ID of the matched variant in the target genome |
REF_TARGET |
The REF allele of the target variant |
IS_MA_TARGET |
Is the target variant multi-allelic? |
AAF |
Allele frequency |
F_MISS_DOSAGE |
Missing dosage rate |
SAME_REF |
Do the reference and target variant have the same reference allele? |
PCA_ELIGIBLE |
Is the variant eligible for PCA inclusion? |
You can use the table to extract IDs and then use something like plink2 to extract a subset of variants.
Help¶
$ pgscatalog-intersect --help