core.lib.scorefiles¶
This module contains classes to compose and contain a ScoringFile: a file
in the PGS Catalog that contains a list of genetic variants and their effect weights.
Scoring files are used to calculate PGS for new target genomes.
Attributes¶
Classes¶
Represents a single scoring file in the PGS Catalog. |
|
This container class provides methods to work with multiple ScoringFile objects. |
Module Contents¶
- class core.lib.scorefiles.ScoringFile(identifier, target_build=None, query_result=None, **kwargs)¶
Represents a single scoring file in the PGS Catalog.
- Parameters:
identifier – A PGS Catalog score accession in the format
PGS123456or a path to a local scoring filetarget_build – An optional
GenomeBuild, which represents the build you want the scoring file to align toquery_result – An optional
ScoreQueryResult, if provided with an accession identifier it prevents hitting the PGS Catalog API
- Raises:
pgscatalog.corelib.InvalidAccessionError – If the PGS Catalog API can’t find the provided accession
pgscatalog.corelib.ScoreFormatError – If you try to iterate over a
ScoringFilewithout a local path (before downloading it)
You can make
ScoringFileswith a path to a scoring file with minimal metadata:>>> from pgscatalog.core.lib.genomebuild import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "custom.txt") >>> sf ScoringFile('.../custom.txt', target_build=None) >>> sf.header ScoreHeader(pgs_id='test', pgs_name='test', trait_reported='test trait', genome_build=GenomeBuild.GRCh37) >>> sf.is_harmonised False
Scoring file from OmicsPred
>>> from pgscatalog.core.lib.genomebuild import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz") >>> sf ScoringFile('.../OPGS002493.txt.gz', target_build=None) >>> sf.header ScoreHeader(pgs_id='OPGS002493', pgs_name='P80162', trait_reported='C-X-C motif chemokine 6', genome_build=GenomeBuild.GRCh37) >>> sf.is_harmonised False >>> for variant in sf.variants: ... variant ... break ScoreVariant(rsID='rs75288020', chr_name='4', chr_position=74151217, effect_allele=Allele(allele='A', is_snp=True)...
Also supports PGS Catalog header metadata:
>>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz") >>> sf ScoringFile('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None) >>> sf.header CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=<ScoreFormatVersion.v2: '2.0'>, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}')
Looking at the header above, the original submission lacked a genome build but has been harmonised:
>>> sf.is_harmonised True
>>> sf.genome_build GenomeBuild.GRCh38
>>> sf.pgs_id 'PGS000001'
>>> for variant in sf.variants: ... variant ... break ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=None, effect_allele=Allele(allele='T', is_snp=True)...
You can also make a
ScoringFileby using PGS Catalog score accessions:>>> sf = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38) >>> sf ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)
It’s important to use the
.download()method when you’re not working with local files, or many attributes and methods will be missing or won’t work:>>> for variant in sf.variants: ... variant ... break Traceback (most recent call last): ... FileNotFoundError: self.local_path=None: did you remember to .download()?
A
ScoringFilecan also be constructed with aScoreQueryResultif you want to be polite to the PGS Catalog API. Just add thequery_resultparameter:>>> score_query_result = sf.catalog_response # extract score query from old query >>> ScoringFile(identifier=sf.pgs_id, query_result=sf.catalog_response) # doesn't hit the PGS Catalog API again ScoringFile('PGS000001', target_build=None)
InvalidAccessionErroris raised if you provide bad identifiers:>>> import tempfile >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("potato", GenomeBuild.GRCh38).download(tmp_dir) Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: Invalid accession: 'potato'
The same exception is raised if you provide a well formatted identifier that doesn’t exist:
>>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000000", GenomeBuild.GRCh38).download(tmp_dir) Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGS000000'
- download(directory, overwrite=False)¶
Download a ScoringFile to a specified directory with checksum validation
- Parameters:
directory – Directory to write file to
overwrite – Overwrite existing file if present
- Raises:
pgscatalog.corelib.ScoreDownloadError – If there’s an unrecoverable problem downloading the file
pgscatalog.corelib.ScoreChecksumError – If md5 validation consistently fails
- Returns:
None
>>> import tempfile, os >>> from pgscatalog.core.lib import GenomeBuild
>>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000001").download(tmp_dir) ... print(os.listdir(tmp_dir)) ['PGS000001.txt.gz']
It’s possible to request a scoring file in a specific genome build:
>>> import tempfile, os >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000001", GenomeBuild.GRCh38).download(tmp_dir) ... print(os.listdir(tmp_dir)) ['PGS000001_hmPOS_GRCh38.txt.gz']
- normalise(liftover=False, drop_missing=False, chain_dir=None, target_build=None)¶
Extracts key fields from a scoring file in a normalised format.
Takes care of quality control.
>>> from pgscatalog.core.lib import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz" >>> variants = ScoringFile(testpath).normalise() >>> for x in variants: ... x ... break ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=69516650, effect_allele=Allele(allele='T', is_snp=True), ...
Supports lifting over scoring files from GRCh37 to GRCh38:
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch38.txt" >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain" >>> sf = ScoringFile(testpath) >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh38) >>> for x in variants: ... (x.rsID, x.chr_name, x.chr_position) ... break ('rs78540526', '11', 69516650)
Example of lifting down (GRCh38 to GRCh37):
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch37.txt" >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain" >>> sf = ScoringFile(testpath) >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh37) >>> for x in variants: ... (x.rsID, x.chr_name, x.chr_position) ... break ('rs78540526', '11', 69331418)
Liftover support is only really useful for custom scoring files that aren’t in the PGS Catalog. It’s always best to use harmonised data when it’s available from the PGS Catalog. Harmonised data goes through a lot of validation and error checking.
A
LiftoverErroris only raised when many converted coordinates are missing.Normalising converts the is_dominant and is_recessive optional fields in scoring files into an EffectType:
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> variants = ScoringFile(testpath).normalise() >>> for i, x in enumerate(variants): ... (x.is_dominant, x.is_recessive, x.effect_type) ... if i == 2: ... break (True, False, EffectType.DOMINANT) (False, True, EffectType.RECESSIVE) (True, False, EffectType.DOMINANT)
- read() collections.abc.Iterator[csv.DictReader]¶
A simple method of reading variants from a scoring file.
Returns a csv.DictReader, so each row is a variant in a dictionary.
No data validation is done. Combine the returned dictionaries with the pydantic models if you want to do that (CatalogScoreVariants).
This method must be called with a context manager:
>>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> with sf.read() as reader: ... for variant in reader: ... variant ... break {'rsID': 'rs10936599', 'chr_name': '3', 'chr_position': '170974795', 'effect_allele': 'T', 'other_allele': 'C', 'effect_weight': '0.123', 'allelefrequency_effect': '0.377', 'is_dominant': 'True', 'is_recessive': 'False', 'locus_name': 'MYNN', 'hm_source': 'ENSEMBL', 'hm_rsID': 'rs10936599', 'hm_chr': '3', 'hm_pos': '169492101', 'hm_inferOtherAllele': ''}
Calling this method directly isn’t helpful:
>>> sf.read() <contextlib._GeneratorContextManager object ...>
Only local scoring files can be read (download them first):
>>> sf = ScoringFile("PGS001229") >>> with sf.read() as f: ... pass Traceback (most recent call last): ... FileNotFoundError: self.local_path=None: did you remember to .download()?
- read_variants() collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]¶
Yields rows from a scoring file as ScoreVariants
ScoreVariants are pydantic models with data validation (PGS Catalog standards)
>>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> variants = sf.read_variants() >>> for i, variant in enumerate(variants): ... variant ... if i == 2: ... break ScoreVariant(rsID='rs10936599', chr_name='3', ... ScoreVariant(rsID='rs6061231', chr_name='20', ... ScoreVariant(rsID='rs10774214', chr_name='12', ...
- property genome_build: pgscatalog.core.lib.GenomeBuild¶
- property header¶
- property is_harmonised: bool¶
- property is_wide: bool¶
- property pgs_id: str¶
- property target_build: pgscatalog.core.lib.GenomeBuild¶
The
GenomeBuildyou want aScoringFileto align to. Useful when using PGS Catalog accessions to instantiate this class.
- property variants: collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]¶
A generator that yields rows from the scoring file as
ScoreVariants, if a local file is available (i.e. after downloading). Always available for class instances that have a valid local path.>>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> for variant in sf.variants: ... variant ... break ScoreVariant(rsID='rs10936599', chr_name='3', chr_position=170974795...
- class core.lib.scorefiles.ScoringFiles(*args, target_build=None, **kwargs)¶
This container class provides methods to work with multiple ScoringFile objects.
You can use publications or trait accessions to instantiate:
>>> from pgscatalog.core.lib import GenomeBuild >>> ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh37) ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh37)
Or multiple PGS IDs:
>>> ScoringFiles("PGS000001", "PGS000002") ScoringFiles('PGS000001', 'PGS000002', target_build=None)
List input is OK too:
>>> ScoringFiles(["PGS000001", "PGS000002"]) ScoringFiles('PGS000001', 'PGS000002', target_build=None)
Or any mixture of publications, traits, and scores:
>>> ScoringFiles("PGP000001", "PGS000001", "PGS000002") ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)
Scoring files with duplicate PGS IDs (accessions) are automatically dropped. In the example above
PGP000001containsPGS000001,PGS000002, andPGS000003.Traits can have children. To include these traits, use the
include_childrenparameter:>>> score_with_children = ScoringFiles("MONDO_0004975", include_children=True) >>> score_wo_children = ScoringFiles("MONDO_0004975", include_children=False) >>> len(score_with_children) > len(score_wo_children) True
For example, Alzheimer’s disease (
MONDO_0004975) includes Late-onset Alzheier’s disease (EFO_1001870) as a child trait.Concatenation works as expected:
>>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003') ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)
But only
ScoringFileswith the same genome build can be concatenated:>>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh38) Traceback (most recent call last): ... TypeError: unsupported operand type(s) for +: 'ScoringFiles' and 'ScoringFiles'
Multiplication doesn’t make sense, because
ScoringFileelements must be unique, so isn’t supported.>>> ScoringFiles('PGS000001') * 3 Traceback (most recent call last): ... TypeError: unsupported operand type(s) for *: 'ScoringFiles' and 'int'
You can slice and iterate over
ScoringFiles:>>> score = ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh38) >>> score[0] ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38) >>> for x in score: ... x ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38) ScoringFile('PGS000002', target_build=GenomeBuild.GRCh38) ScoringFile('PGS000003', target_build=GenomeBuild.GRCh38) >>> score[0] in score True
The accession validation rules apply from
ScoringFile:>>> ScoringFiles("PGPpotato") Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGPpotato'
Local files can also be used to instantiate
ScoringFiles:>>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... x = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38) ... x.download(directory=d) ... ScoringFiles(x.local_path) ScoringFiles('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None)
But the
target_buildparameter doesn’t work with local files:>>> with tempfile.TemporaryDirectory() as d: ... x = ScoringFile("PGS000002", target_build=GenomeBuild.GRCh38) ... x.download(directory=d) ... ScoringFiles(x.local_path, target_build=GenomeBuild.GRCh37) Traceback (most recent call last): ... ValueError: Can't load local scoring file when target_build is setTry .normalise() method to do liftover, or load harmonised scoring files from PGS Catalog
If you have a local scoring file that needs to change genome build, and using PGS Catalog harmonised data isn’t an option, you should make a
ScoringFilefrom a path, then use thenormalise()method with liftover enabled.- property elements¶
Returns a list of
ScoringFileobjects contained insideScoringFiles
- target_build = None¶
- core.lib.scorefiles.logger¶