How to combine scoring files from the PGS Catalog¶
pgscatalog-combine
is a CLI application that makes it easy to combine scoring files into a standardised output.
The process involves:
extracting important fields from scoring files
doing some quality control checks
optionally lifting over variants to a consistent genome build
writing a long format / melted output file
Input scoring files must follow PGS Catalog standards. The output file is useful for doing data science tasks, like matching variants across a scoring file and target genome.
Installation¶
$ pip install pgscatalog-core
Usage¶
Combining PGS Catalog scoring files¶
Tip
It’s easiest to get started by downloading scoring files in the same genome build: How to download scoring files from the PGS Catalog
$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz PGS0001229_hmPOS_GRCh38.txt.gz -t GRCh38 -o combined.txt
Note
If you’re combining lots of files, you can compress the output automatically --o combined.txt.gz
Lifting over scoring files¶
It’s possible to combine scoring files with different genome builds using liftover.
Danger
You should only do this when combining PGS Catalog and custom scoring files, because the PGS Catalog provides harmonised data
First, download chain files from UCSC:
And copy them into a directory (e.g. my_chain_dir/
).
Assuming you have a custom scoring file in GRCh37 (my_scorefile_grch37.txt.gz
), and you want to combine it with a PGS Catalog scoring file in GRCh38.
$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz my_scorefile_grch37.txt.gz \
--chain_dir my_chain_dir/ \
-t GRCh38 \
-o combined.txt
Help¶
$ pgscatalog-combine --help