core.lib.scorefiles

This module contains classes to compose and contain a ScoringFile: a file in the PGS Catalog that contains a list of genetic variants and their effect weights. Scoring files are used to calculate PGS for new target genomes.

Attributes

logger

Classes

ScoringFile

Represents a single scoring file in the PGS Catalog.

ScoringFiles

This container class provides methods to work with multiple ScoringFile objects.

Module Contents

class core.lib.scorefiles.ScoringFile(identifier, target_build=None, query_result=None, **kwargs)

Represents a single scoring file in the PGS Catalog.

Parameters:
  • identifier – A PGS Catalog score accession in the format PGS123456 or a path to a local scoring file

  • target_build – An optional GenomeBuild, which represents the build you want the scoring file to align to

  • query_result – An optional ScoreQueryResult, if provided with an accession identifier it prevents hitting the PGS Catalog API

Raises:
  • pgscatalog.corelib.InvalidAccessionError – If the PGS Catalog API can’t find the provided accession

  • pgscatalog.corelib.ScoreFormatError – If you try to iterate over a ScoringFile without a local path (before downloading it)

You can make ScoringFiles with a path to a scoring file with minimal metadata:

>>> from pgscatalog.core.lib.genomebuild import GenomeBuild
>>> from pgscatalog.core.lib._config import Config
>>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "custom.txt")
>>> sf
ScoringFile('.../custom.txt', target_build=None)
>>> sf.header
ScoreHeader(pgs_id='test', pgs_name='test', trait_reported='test trait', genome_build=GenomeBuild.GRCh37)
>>> sf.is_harmonised
False

Scoring file from OmicsPred

>>> from pgscatalog.core.lib.genomebuild import GenomeBuild
>>> from pgscatalog.core.lib._config import Config
>>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz")
>>> sf
ScoringFile('.../OPGS002493.txt.gz', target_build=None)
>>> sf.header
ScoreHeader(pgs_id='OPGS002493', pgs_name='P80162', trait_reported='C-X-C motif chemokine 6', genome_build=GenomeBuild.GRCh37)
>>> sf.is_harmonised
False
>>> for variant in sf.variants:
...     variant
...     break
ScoreVariant(rsID='rs75288020', chr_name='4', chr_position=74151217, effect_allele=Allele(allele='A', is_snp=True)...

Also supports PGS Catalog header metadata:

>>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz")
>>> sf
ScoringFile('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None)
>>> sf.header
CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=<ScoreFormatVersion.v2: '2.0'>, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}')

Looking at the header above, the original submission lacked a genome build but has been harmonised:

>>> sf.is_harmonised
True
>>> sf.genome_build
GenomeBuild.GRCh38
>>> sf.pgs_id
'PGS000001'
>>> for variant in sf.variants:
...     variant
...     break
ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=None, effect_allele=Allele(allele='T', is_snp=True)...

You can also make a ScoringFile by using PGS Catalog score accessions:

>>> sf = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38)
>>> sf
ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)

It’s important to use the .download() method when you’re not working with local files, or many attributes and methods will be missing or won’t work:

>>> for variant in sf.variants:
...     variant
...     break
Traceback (most recent call last):
...
FileNotFoundError: self.local_path=None: did you remember to .download()?

A ScoringFile can also be constructed with a ScoreQueryResult if you want to be polite to the PGS Catalog API. Just add the query_result parameter:

>>> score_query_result = sf.catalog_response  # extract score query from old query
>>> ScoringFile(identifier=sf.pgs_id, query_result=sf.catalog_response)  # doesn't hit the PGS Catalog API again
ScoringFile('PGS000001', target_build=None)

InvalidAccessionError is raised if you provide bad identifiers:

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as tmp_dir:
...     ScoringFile("potato", GenomeBuild.GRCh38).download(tmp_dir)
Traceback (most recent call last):
...
pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: Invalid accession: 'potato'

The same exception is raised if you provide a well formatted identifier that doesn’t exist:

>>> with tempfile.TemporaryDirectory() as tmp_dir:
...     ScoringFile("PGS000000", GenomeBuild.GRCh38).download(tmp_dir)
Traceback (most recent call last):
...
pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGS000000'
download(directory, overwrite=False)

Download a ScoringFile to a specified directory with checksum validation

Parameters:
  • directory – Directory to write file to

  • overwrite – Overwrite existing file if present

Raises:
  • pgscatalog.corelib.ScoreDownloadError – If there’s an unrecoverable problem downloading the file

  • pgscatalog.corelib.ScoreChecksumError – If md5 validation consistently fails

Returns:

None

>>> import tempfile, os
>>> from pgscatalog.core.lib import GenomeBuild
>>> with tempfile.TemporaryDirectory() as tmp_dir:
...     ScoringFile("PGS000001").download(tmp_dir)
...     print(os.listdir(tmp_dir))
['PGS000001.txt.gz']

It’s possible to request a scoring file in a specific genome build:

>>> import tempfile, os
>>> with tempfile.TemporaryDirectory() as tmp_dir:
...     ScoringFile("PGS000001", GenomeBuild.GRCh38).download(tmp_dir)
...     print(os.listdir(tmp_dir))
['PGS000001_hmPOS_GRCh38.txt.gz']
normalise(liftover=False, drop_missing=False, chain_dir=None, target_build=None)

Extracts key fields from a scoring file in a normalised format.

Takes care of quality control.

>>> from pgscatalog.core.lib import GenomeBuild
>>> from pgscatalog.core.lib._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
>>> variants = ScoringFile(testpath).normalise()
>>> for x in variants:
...     x
...     break
ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=69516650, effect_allele=Allele(allele='T', is_snp=True), ...

Supports lifting over scoring files from GRCh37 to GRCh38:

>>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch38.txt"
>>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain"
>>> sf = ScoringFile(testpath)
>>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh38)
>>> for x in variants:
...     (x.rsID, x.chr_name, x.chr_position)
...     break
('rs78540526', '11', 69516650)

Example of lifting down (GRCh38 to GRCh37):

>>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch37.txt"
>>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain"
>>> sf = ScoringFile(testpath)
>>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh37)
>>> for x in variants:
...     (x.rsID, x.chr_name, x.chr_position)
...     break
('rs78540526', '11', 69331418)

Liftover support is only really useful for custom scoring files that aren’t in the PGS Catalog. It’s always best to use harmonised data when it’s available from the PGS Catalog. Harmonised data goes through a lot of validation and error checking.

A LiftoverError is only raised when many converted coordinates are missing.

Normalising converts the is_dominant and is_recessive optional fields in scoring files into an EffectType:

>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
>>> variants = ScoringFile(testpath).normalise()
>>> for i, x in enumerate(variants):
...     (x.is_dominant, x.is_recessive, x.effect_type)
...     if i == 2:
...         break
(True, False, EffectType.DOMINANT)
(False, True, EffectType.RECESSIVE)
(True, False, EffectType.DOMINANT)
read() collections.abc.Iterator[csv.DictReader]

A simple method of reading variants from a scoring file.

Returns a csv.DictReader, so each row is a variant in a dictionary.

No data validation is done. Combine the returned dictionaries with the pydantic models if you want to do that (CatalogScoreVariants).

This method must be called with a context manager:

>>> from pgscatalog.core.lib._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
>>> sf = ScoringFile(testpath)
>>> with sf.read() as reader:
...     for variant in reader:
...         variant
...         break
{'rsID': 'rs10936599', 'chr_name': '3', 'chr_position': '170974795', 'effect_allele': 'T', 'other_allele': 'C', 'effect_weight': '0.123', 'allelefrequency_effect': '0.377', 'is_dominant': 'True', 'is_recessive': 'False', 'locus_name': 'MYNN', 'hm_source': 'ENSEMBL', 'hm_rsID': 'rs10936599', 'hm_chr': '3', 'hm_pos': '169492101', 'hm_inferOtherAllele': ''}

Calling this method directly isn’t helpful:

>>> sf.read()
<contextlib._GeneratorContextManager object ...>

Only local scoring files can be read (download them first):

>>> sf = ScoringFile("PGS001229")
>>> with sf.read() as f:
...     pass
Traceback (most recent call last):
...
FileNotFoundError: self.local_path=None: did you remember to .download()?
read_variants() collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]

Yields rows from a scoring file as ScoreVariants

ScoreVariants are pydantic models with data validation (PGS Catalog standards)

>>> from pgscatalog.core.lib._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
>>> sf = ScoringFile(testpath)
>>> variants = sf.read_variants()
>>> for i, variant in enumerate(variants):
...     variant
...     if i == 2:
...         break
ScoreVariant(rsID='rs10936599', chr_name='3', ...
ScoreVariant(rsID='rs6061231', chr_name='20', ...
ScoreVariant(rsID='rs10774214', chr_name='12', ...
property genome_build: pgscatalog.core.lib.GenomeBuild
property header
property is_harmonised: bool
property is_wide: bool
property pgs_id: str
property target_build: pgscatalog.core.lib.GenomeBuild

The GenomeBuild you want a ScoringFile to align to. Useful when using PGS Catalog accessions to instantiate this class.

property variants: collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]

A generator that yields rows from the scoring file as ScoreVariants, if a local file is available (i.e. after downloading). Always available for class instances that have a valid local path.

>>> from pgscatalog.core.lib._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
>>> sf = ScoringFile(testpath)
>>> for variant in sf.variants:
...     variant
...     break
ScoreVariant(rsID='rs10936599', chr_name='3', chr_position=170974795...
class core.lib.scorefiles.ScoringFiles(*args, target_build=None, **kwargs)

This container class provides methods to work with multiple ScoringFile objects.

You can use publications or trait accessions to instantiate:

>>> from pgscatalog.core.lib import GenomeBuild
>>> ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh37)
ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh37)

Or multiple PGS IDs:

>>> ScoringFiles("PGS000001", "PGS000002")
ScoringFiles('PGS000001', 'PGS000002', target_build=None)

List input is OK too:

>>> ScoringFiles(["PGS000001", "PGS000002"])
ScoringFiles('PGS000001', 'PGS000002', target_build=None)

Or any mixture of publications, traits, and scores:

>>> ScoringFiles("PGP000001", "PGS000001", "PGS000002")
ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)

Scoring files with duplicate PGS IDs (accessions) are automatically dropped. In the example above PGP000001 contains PGS000001, PGS000002, and PGS000003.

Traits can have children. To include these traits, use the include_children parameter:

>>> score_with_children = ScoringFiles("MONDO_0004975", include_children=True)
>>> score_wo_children = ScoringFiles("MONDO_0004975", include_children=False)
>>> len(score_with_children) > len(score_wo_children)
True

For example, Alzheimer’s disease (MONDO_0004975) includes Late-onset Alzheier’s disease (EFO_1001870) as a child trait.

Concatenation works as expected:

>>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003')
ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)

But only ScoringFiles with the same genome build can be concatenated:

>>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh38)
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'ScoringFiles' and 'ScoringFiles'

Multiplication doesn’t make sense, because ScoringFile elements must be unique, so isn’t supported.

>>> ScoringFiles('PGS000001') * 3
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for *: 'ScoringFiles' and 'int'

You can slice and iterate over ScoringFiles:

>>> score = ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh38)
>>> score[0]
ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)
>>> for x in score:
...     x
ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)
ScoringFile('PGS000002', target_build=GenomeBuild.GRCh38)
ScoringFile('PGS000003', target_build=GenomeBuild.GRCh38)
>>> score[0] in score
True

The accession validation rules apply from ScoringFile:

>>> ScoringFiles("PGPpotato")
Traceback (most recent call last):
...
pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGPpotato'

Local files can also be used to instantiate ScoringFiles:

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as d:
...     x = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38)
...     x.download(directory=d)
...     ScoringFiles(x.local_path)
ScoringFiles('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None)

But the target_build parameter doesn’t work with local files:

>>> with tempfile.TemporaryDirectory() as d:
...     x = ScoringFile("PGS000002", target_build=GenomeBuild.GRCh38)
...     x.download(directory=d)
...     ScoringFiles(x.local_path, target_build=GenomeBuild.GRCh37)
Traceback (most recent call last):
...
ValueError: Can't load local scoring file when target_build is setTry .normalise() method to do liftover, or load harmonised scoring files from PGS Catalog

If you have a local scoring file that needs to change genome build, and using PGS Catalog harmonised data isn’t an option, you should make a ScoringFile from a path, then use the normalise() method with liftover enabled.

property elements

Returns a list of ScoringFile objects contained inside ScoringFiles

target_build = None
core.lib.scorefiles.logger