core.lib.scorefiles =================== .. py:module:: core.lib.scorefiles .. autoapi-nested-parse:: This module contains classes to compose and contain a ``ScoringFile``: a file in the PGS Catalog that contains a list of genetic variants and their effect weights. Scoring files are used to calculate PGS for new target genomes. Attributes ---------- .. autoapisummary:: core.lib.scorefiles.logger Classes ------- .. autoapisummary:: core.lib.scorefiles.ScoringFile core.lib.scorefiles.ScoringFiles Module Contents --------------- .. py:class:: ScoringFile(identifier, target_build=None, query_result=None, **kwargs) Represents a single scoring file in the PGS Catalog. :param identifier: A PGS Catalog score accession in the format ``PGS123456`` or a path to a local scoring file :param target_build: An optional :class:`GenomeBuild`, which represents the build you want the scoring file to align to :param query_result: An optional :class:`ScoreQueryResult`, if provided with an accession identifier it prevents hitting the PGS Catalog API :raises pgscatalog.corelib.InvalidAccessionError: If the PGS Catalog API can't find the provided accession :raises pgscatalog.corelib.ScoreFormatError: If you try to iterate over a ``ScoringFile`` without a local path (before downloading it) You can make ``ScoringFiles`` with a path to a scoring file with minimal metadata: >>> from pgscatalog.core.lib.genomebuild import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "custom.txt") >>> sf # doctest: +ELLIPSIS ScoringFile('.../custom.txt', target_build=None) >>> sf.header ScoreHeader(pgs_id='test', pgs_name='test', trait_reported='test trait', genome_build=GenomeBuild.GRCh37) >>> sf.is_harmonised False Scoring file from OmicsPred >>> from pgscatalog.core.lib.genomebuild import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz") >>> sf ScoringFile('.../OPGS002493.txt.gz', target_build=None) >>> sf.header ScoreHeader(pgs_id='OPGS002493', pgs_name='P80162', trait_reported='C-X-C motif chemokine 6', genome_build=GenomeBuild.GRCh37) >>> sf.is_harmonised False >>> for variant in sf.variants: ... variant ... break ScoreVariant(rsID='rs75288020', chr_name='4', chr_position=74151217, effect_allele=Allele(allele='A', is_snp=True)... Also supports PGS Catalog header metadata: >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz") >>> sf # doctest: +ELLIPSIS ScoringFile('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None) >>> sf.header CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}') Looking at the header above, the original submission lacked a genome build but has been harmonised: >>> sf.is_harmonised True >>> sf.genome_build GenomeBuild.GRCh38 >>> sf.pgs_id 'PGS000001' >>> for variant in sf.variants: # doctest: +ELLIPSIS ... variant ... break ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=None, effect_allele=Allele(allele='T', is_snp=True)... You can also make a ``ScoringFile`` by using PGS Catalog score accessions: >>> sf = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38) >>> sf ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38) It's important to use the ``.download()`` method when you're not working with local files, or many attributes and methods will be missing or won't work: >>> for variant in sf.variants: ... variant ... break Traceback (most recent call last): ... FileNotFoundError: self.local_path=None: did you remember to .download()? A ``ScoringFile`` can also be constructed with a ``ScoreQueryResult`` if you want to be polite to the PGS Catalog API. Just add the ``query_result`` parameter: >>> score_query_result = sf.catalog_response # extract score query from old query >>> ScoringFile(identifier=sf.pgs_id, query_result=sf.catalog_response) # doesn't hit the PGS Catalog API again ScoringFile('PGS000001', target_build=None) :class:`InvalidAccessionError` is raised if you provide bad identifiers: >>> import tempfile >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("potato", GenomeBuild.GRCh38).download(tmp_dir) Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: Invalid accession: 'potato' The same exception is raised if you provide a well formatted identifier that doesn't exist: >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000000", GenomeBuild.GRCh38).download(tmp_dir) Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGS000000' .. py:method:: download(directory, overwrite=False) Download a ScoringFile to a specified directory with checksum validation :param directory: Directory to write file to :param overwrite: Overwrite existing file if present :raises pgscatalog.corelib.ScoreDownloadError: If there's an unrecoverable problem downloading the file :raises pgscatalog.corelib.ScoreChecksumError: If md5 validation consistently fails :returns: None >>> import tempfile, os >>> from pgscatalog.core.lib import GenomeBuild >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000001").download(tmp_dir) ... print(os.listdir(tmp_dir)) ['PGS000001.txt.gz'] It's possible to request a scoring file in a specific genome build: >>> import tempfile, os >>> with tempfile.TemporaryDirectory() as tmp_dir: ... ScoringFile("PGS000001", GenomeBuild.GRCh38).download(tmp_dir) ... print(os.listdir(tmp_dir)) ['PGS000001_hmPOS_GRCh38.txt.gz'] .. py:method:: normalise(liftover=False, drop_missing=False, chain_dir=None, target_build=None) Extracts key fields from a scoring file in a normalised format. Takes care of quality control. >>> from pgscatalog.core.lib import GenomeBuild >>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz" >>> variants = ScoringFile(testpath).normalise() >>> for x in variants: # doctest: +ELLIPSIS ... x ... break ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=69516650, effect_allele=Allele(allele='T', is_snp=True), ... Supports lifting over scoring files from GRCh37 to GRCh38: >>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch38.txt" >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain" >>> sf = ScoringFile(testpath) >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh38) >>> for x in variants: ... (x.rsID, x.chr_name, x.chr_position) ... break ('rs78540526', '11', 69516650) Example of lifting down (GRCh38 to GRCh37): >>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch37.txt" >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain" >>> sf = ScoringFile(testpath) >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh37) >>> for x in variants: ... (x.rsID, x.chr_name, x.chr_position) ... break ('rs78540526', '11', 69331418) Liftover support is only really useful for custom scoring files that aren't in the PGS Catalog. It's always best to use harmonised data when it's available from the PGS Catalog. Harmonised data goes through a lot of validation and error checking. A :class:`LiftoverError` is only raised when many converted coordinates are missing. Normalising converts the is_dominant and is_recessive optional fields in scoring files into an EffectType: >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> variants = ScoringFile(testpath).normalise() >>> for i, x in enumerate(variants): # doctest: +ELLIPSIS ... (x.is_dominant, x.is_recessive, x.effect_type) ... if i == 2: ... break (True, False, EffectType.DOMINANT) (False, True, EffectType.RECESSIVE) (True, False, EffectType.DOMINANT) .. py:method:: read() -> collections.abc.Iterator[csv.DictReader] A simple method of reading variants from a scoring file. Returns a csv.DictReader, so each row is a variant in a dictionary. No data validation is done. Combine the returned dictionaries with the pydantic models if you want to do that (CatalogScoreVariants). This method must be called with a context manager: >>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> with sf.read() as reader: ... for variant in reader: ... variant ... break {'rsID': 'rs10936599', 'chr_name': '3', 'chr_position': '170974795', 'effect_allele': 'T', 'other_allele': 'C', 'effect_weight': '0.123', 'allelefrequency_effect': '0.377', 'is_dominant': 'True', 'is_recessive': 'False', 'locus_name': 'MYNN', 'hm_source': 'ENSEMBL', 'hm_rsID': 'rs10936599', 'hm_chr': '3', 'hm_pos': '169492101', 'hm_inferOtherAllele': ''} Calling this method directly isn't helpful: >>> sf.read() Only local scoring files can be read (download them first): >>> sf = ScoringFile("PGS001229") >>> with sf.read() as f: ... pass Traceback (most recent call last): ... FileNotFoundError: self.local_path=None: did you remember to .download()? .. py:method:: read_variants() -> collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None] Yields rows from a scoring file as ScoreVariants ScoreVariants are pydantic models with data validation (PGS Catalog standards) >>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> variants = sf.read_variants() >>> for i, variant in enumerate(variants): ... variant ... if i == 2: ... break ScoreVariant(rsID='rs10936599', chr_name='3', ... ScoreVariant(rsID='rs6061231', chr_name='20', ... ScoreVariant(rsID='rs10774214', chr_name='12', ... .. py:property:: genome_build :type: pgscatalog.core.lib.GenomeBuild .. py:property:: header .. py:property:: is_harmonised :type: bool .. py:property:: is_wide :type: bool .. py:property:: pgs_id :type: str .. py:property:: target_build :type: pgscatalog.core.lib.GenomeBuild The ``GenomeBuild`` you want a ``ScoringFile`` to align to. Useful when using PGS Catalog accessions to instantiate this class. .. py:property:: variants :type: collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None] A generator that yields rows from the scoring file as ``ScoreVariants``, if a local file is available (i.e. after downloading). Always available for class instances that have a valid local path. >>> from pgscatalog.core.lib._config import Config >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt" >>> sf = ScoringFile(testpath) >>> for variant in sf.variants: ... variant ... break ScoreVariant(rsID='rs10936599', chr_name='3', chr_position=170974795... .. py:class:: ScoringFiles(*args, target_build=None, **kwargs) This container class provides methods to work with multiple ScoringFile objects. You can use publications or trait accessions to instantiate: >>> from pgscatalog.core.lib import GenomeBuild >>> ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh37) ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh37) Or multiple PGS IDs: >>> ScoringFiles("PGS000001", "PGS000002") ScoringFiles('PGS000001', 'PGS000002', target_build=None) List input is OK too: >>> ScoringFiles(["PGS000001", "PGS000002"]) ScoringFiles('PGS000001', 'PGS000002', target_build=None) Or any mixture of publications, traits, and scores: >>> ScoringFiles("PGP000001", "PGS000001", "PGS000002") ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None) Scoring files with duplicate PGS IDs (accessions) are automatically dropped. In the example above ``PGP000001`` contains ``PGS000001``, ``PGS000002``, and ``PGS000003``. Traits can have children. To include these traits, use the ``include_children`` parameter: >>> score_with_children = ScoringFiles("MONDO_0004975", include_children=True) >>> score_wo_children = ScoringFiles("MONDO_0004975", include_children=False) >>> len(score_with_children) > len(score_wo_children) True For example, Alzheimer's disease (``MONDO_0004975``) includes Late-onset Alzheier's disease (``EFO_1001870``) as a child trait. Concatenation works as expected: >>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003') ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None) But only :class:`ScoringFiles` with the same genome build can be concatenated: >>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh38) Traceback (most recent call last): ... TypeError: unsupported operand type(s) for +: 'ScoringFiles' and 'ScoringFiles' Multiplication doesn't make sense, because :class:`ScoringFile` elements must be unique, so isn't supported. >>> ScoringFiles('PGS000001') * 3 Traceback (most recent call last): ... TypeError: unsupported operand type(s) for *: 'ScoringFiles' and 'int' You can slice and iterate over :class:`ScoringFiles`: >>> score = ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh38) >>> score[0] ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38) >>> for x in score: ... x ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38) ScoringFile('PGS000002', target_build=GenomeBuild.GRCh38) ScoringFile('PGS000003', target_build=GenomeBuild.GRCh38) >>> score[0] in score True The accession validation rules apply from :class:`ScoringFile`: >>> ScoringFiles("PGPpotato") Traceback (most recent call last): ... pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGPpotato' Local files can also be used to instantiate :class:`ScoringFiles`: >>> import tempfile >>> with tempfile.TemporaryDirectory() as d: ... x = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38) ... x.download(directory=d) ... ScoringFiles(x.local_path) # doctest: +ELLIPSIS ScoringFiles('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None) But the ``target_build`` parameter doesn't work with local files: >>> with tempfile.TemporaryDirectory() as d: ... x = ScoringFile("PGS000002", target_build=GenomeBuild.GRCh38) ... x.download(directory=d) ... ScoringFiles(x.local_path, target_build=GenomeBuild.GRCh37) Traceback (most recent call last): ... ValueError: Can't load local scoring file when target_build is setTry .normalise() method to do liftover, or load harmonised scoring files from PGS Catalog If you have a local scoring file that needs to change genome build, and using PGS Catalog harmonised data isn't an option, you should make a :class:`ScoringFile` from a path, then use the ``normalise()`` method with liftover enabled. .. py:property:: elements Returns a list of :class:`ScoringFile` objects contained inside :class:`ScoringFiles` .. py:attribute:: target_build :value: None .. py:data:: logger