How to format scoring files from the PGS Catalog¶
pgscatalog-format is a CLI application that makes it easy to combine scoring files into a standardised output.
Note
pgscatalog-combine was recently renamed to pgscatalog-format
The process involves:
extracting important fields from scoring files
doing some quality control checks
optionally lifting over variants to a consistent genome build
writing to a consistent schema
Input scoring files must follow PGS Catalog standards. The output file is useful for doing data science tasks, like matching variants across a scoring file and target genome.
Installation¶
$ pipx install pgscatalog-core
Usage¶
Combining PGS Catalog scoring files¶
Tip
It’s easiest to get started by downloading scoring files in the same genome build: How to download scoring files from the PGS Catalog
$ mkdir output
$ pgscatalog-combine -s PGS000001_hmPOS_GRCh38.txt.gz PGS0001229_hmPOS_GRCh38.txt.gz -t GRCh38 -o output
For each input scoring file a formatted scoring file will be written to the output directory.
Tip
If you’re formatting lots of scoring files in parallel the --threads parameter can help speed up the process
Lifting over scoring files¶
It’s possible to combine scoring files with different genome builds using liftover.
Danger
You should only do this when combining PGS Catalog and custom scoring files, because the PGS Catalog provides harmonised data
First, download chain files from UCSC:
And copy them into a directory (e.g. my_chain_dir/).
Assuming you have a custom scoring file in GRCh37 (my_scorefile_grch37.txt.gz), and you want to combine it with a PGS Catalog scoring file in GRCh38.
$ mkdir output
$ pgscatalog-format -s PGS000001_hmPOS_GRCh38.txt.gz my_scorefile_grch37.txt.gz \
--chain_dir my_chain_dir/ \
-t GRCh38 \
-o output
Help¶
$ pgscatalog-format --help