core.lib.scorefiles
===================

.. py:module:: core.lib.scorefiles

.. autoapi-nested-parse::

   This module contains classes to compose and contain a ``ScoringFile``: a file
   in the PGS Catalog that contains a list of genetic variants and their effect weights.
   Scoring files are used to calculate PGS for new target genomes.


Attributes
----------

.. autoapisummary::

   core.lib.scorefiles.logger


Classes
-------

.. autoapisummary::

   core.lib.scorefiles.ScoringFile
   core.lib.scorefiles.ScoringFiles


Module Contents
---------------

.. py:class:: ScoringFile(identifier, target_build=None, query_result=None, **kwargs)

   Represents a single scoring file in the PGS Catalog.

   :param identifier: A PGS Catalog score accession in the format ``PGS123456`` or a path to a local scoring file
   :param target_build: An optional :class:`GenomeBuild`, which represents the build you want the scoring file to align to
   :param query_result: An optional :class:`ScoreQueryResult`, if provided with an accession identifier it prevents hitting the PGS Catalog API
   :raises pgscatalog.corelib.InvalidAccessionError: If the PGS Catalog API can't find the provided accession
   :raises pgscatalog.corelib.ScoreFormatError: If you try to iterate over a ``ScoringFile`` without a local path (before downloading it)

   You can make ``ScoringFiles`` with a path to a scoring file with minimal metadata:

   >>> from pgscatalog.core.lib.genomebuild import GenomeBuild
   >>> from pgscatalog.core.lib._config import Config
   >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "custom.txt")
   >>> sf # doctest: +ELLIPSIS
   ScoringFile('.../custom.txt', target_build=None)
   >>> sf.header
   ScoreHeader(pgs_id='test', pgs_name='test', trait_reported='test trait', genome_build=GenomeBuild.GRCh37)
   >>> sf.is_harmonised
   False

   Scoring file from OmicsPred

   >>> from pgscatalog.core.lib.genomebuild import GenomeBuild
   >>> from pgscatalog.core.lib._config import Config
   >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz")
   >>> sf
   ScoringFile('.../OPGS002493.txt.gz', target_build=None)
   >>> sf.header
   ScoreHeader(pgs_id='OPGS002493', pgs_name='P80162', trait_reported='C-X-C motif chemokine 6', genome_build=GenomeBuild.GRCh37)
   >>> sf.is_harmonised
   False
   >>> for variant in sf.variants:
   ...     variant
   ...     break
   ScoreVariant(rsID='rs75288020', chr_name='4', chr_position=74151217, effect_allele=Allele(allele='A', is_snp=True)...

   Also supports PGS Catalog header metadata:

   >>> sf = ScoringFile(Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz")
   >>> sf # doctest: +ELLIPSIS
   ScoringFile('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None)
   >>> sf.header
   CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=<ScoreFormatVersion.v2: '2.0'>, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}')

   Looking at the header above, the original submission lacked a genome build but has been harmonised:

   >>> sf.is_harmonised
   True

   >>> sf.genome_build
   GenomeBuild.GRCh38

   >>> sf.pgs_id
   'PGS000001'

   >>> for variant in sf.variants: # doctest: +ELLIPSIS
   ...     variant
   ...     break
   ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=None, effect_allele=Allele(allele='T', is_snp=True)...

   You can also make a ``ScoringFile`` by using PGS Catalog score accessions:

   >>> sf = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38)
   >>> sf
   ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)

   It's important to use the ``.download()`` method when you're not working with local files,
   or many attributes and methods will be missing or won't work:

   >>> for variant in sf.variants:
   ...     variant
   ...     break
   Traceback (most recent call last):
   ...
   FileNotFoundError: self.local_path=None: did you remember to .download()?

   A ``ScoringFile`` can also be constructed with a ``ScoreQueryResult`` if you want
   to be polite to the PGS Catalog API. Just add the ``query_result`` parameter:

   >>> score_query_result = sf.catalog_response  # extract score query from old query
   >>> ScoringFile(identifier=sf.pgs_id, query_result=sf.catalog_response)  # doesn't hit the PGS Catalog API again
   ScoringFile('PGS000001', target_build=None)

   :class:`InvalidAccessionError` is raised if you provide bad identifiers:

   >>> import tempfile
   >>> with tempfile.TemporaryDirectory() as tmp_dir:
   ...     ScoringFile("potato", GenomeBuild.GRCh38).download(tmp_dir)
   Traceback (most recent call last):
   ...
   pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: Invalid accession: 'potato'

   The same exception is raised if you provide a well formatted identifier that doesn't exist:

   >>> with tempfile.TemporaryDirectory() as tmp_dir:
   ...     ScoringFile("PGS000000", GenomeBuild.GRCh38).download(tmp_dir)
   Traceback (most recent call last):
   ...
   pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGS000000'


   .. py:method:: download(directory, overwrite=False)

      Download a ScoringFile to a specified directory with checksum validation

      :param directory: Directory to write file to
      :param overwrite: Overwrite existing file if present

      :raises pgscatalog.corelib.ScoreDownloadError: If there's an unrecoverable problem downloading the file
      :raises pgscatalog.corelib.ScoreChecksumError: If md5 validation consistently fails

      :returns: None

      >>> import tempfile, os
      >>> from pgscatalog.core.lib import GenomeBuild

      >>> with tempfile.TemporaryDirectory() as tmp_dir:
      ...     ScoringFile("PGS000001").download(tmp_dir)
      ...     print(os.listdir(tmp_dir))
      ['PGS000001.txt.gz']

      It's possible to request a scoring file in a specific genome build:

      >>> import tempfile, os
      >>> with tempfile.TemporaryDirectory() as tmp_dir:
      ...     ScoringFile("PGS000001", GenomeBuild.GRCh38).download(tmp_dir)
      ...     print(os.listdir(tmp_dir))
      ['PGS000001_hmPOS_GRCh38.txt.gz']


   .. py:method:: normalise(liftover=False, drop_missing=False, chain_dir=None, target_build=None)

      Extracts key fields from a scoring file in a normalised format.

      Takes care of quality control.

      >>> from pgscatalog.core.lib import GenomeBuild
      >>> from pgscatalog.core.lib._config import Config
      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
      >>> variants = ScoringFile(testpath).normalise()
      >>> for x in variants: # doctest: +ELLIPSIS
      ...     x
      ...     break
      ScoreVariant(rsID='rs78540526', chr_name='11', chr_position=69516650, effect_allele=Allele(allele='T', is_snp=True), ...

      Supports lifting over scoring files from GRCh37 to GRCh38:

      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch38.txt"
      >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain"
      >>> sf = ScoringFile(testpath)
      >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh38)
      >>> for x in variants:
      ...     (x.rsID, x.chr_name, x.chr_position)
      ...     break
      ('rs78540526', '11', 69516650)

      Example of lifting down (GRCh38 to GRCh37):

      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "lift_to_grch37.txt"
      >>> chaindir = Config.ROOT_DIR / "tests" / "data" / "chain"
      >>> sf = ScoringFile(testpath)
      >>> variants = sf.normalise(liftover=True, chain_dir=chaindir, target_build=GenomeBuild.GRCh37)
      >>> for x in variants:
      ...     (x.rsID, x.chr_name, x.chr_position)
      ...     break
      ('rs78540526', '11', 69331418)

      Liftover support is only really useful for custom scoring files that aren't
      in the PGS Catalog. It's always best to use harmonised data when it's
      available from the PGS Catalog. Harmonised data goes through a lot of validation
      and error checking.

      A :class:`LiftoverError` is only raised when many converted coordinates are missing.

      Normalising converts the is_dominant and is_recessive optional fields in
      scoring files into an EffectType:

      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
      >>> variants = ScoringFile(testpath).normalise()
      >>> for i, x in enumerate(variants): # doctest: +ELLIPSIS
      ...     (x.is_dominant, x.is_recessive, x.effect_type)
      ...     if i == 2:
      ...         break
      (True, False, EffectType.DOMINANT)
      (False, True, EffectType.RECESSIVE)
      (True, False, EffectType.DOMINANT)


   .. py:method:: read() -> collections.abc.Iterator[csv.DictReader]

      A simple method of reading variants from a scoring file.

      Returns a csv.DictReader, so each row is a variant in a dictionary.

      No data validation is done. Combine the returned dictionaries with the pydantic models if you want to do that (CatalogScoreVariants).

      This method must be called with a context manager:

      >>> from pgscatalog.core.lib._config import Config
      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
      >>> sf = ScoringFile(testpath)
      >>> with sf.read() as reader:
      ...     for variant in reader:
      ...         variant
      ...         break
      {'rsID': 'rs10936599', 'chr_name': '3', 'chr_position': '170974795', 'effect_allele': 'T', 'other_allele': 'C', 'effect_weight': '0.123', 'allelefrequency_effect': '0.377', 'is_dominant': 'True', 'is_recessive': 'False', 'locus_name': 'MYNN', 'hm_source': 'ENSEMBL', 'hm_rsID': 'rs10936599', 'hm_chr': '3', 'hm_pos': '169492101', 'hm_inferOtherAllele': ''}

      Calling this method directly isn't helpful:

      >>> sf.read()
      <contextlib._GeneratorContextManager object ...>

      Only local scoring files can be read (download them first):

      >>> sf = ScoringFile("PGS001229")
      >>> with sf.read() as f:
      ...     pass
      Traceback (most recent call last):
      ...
      FileNotFoundError: self.local_path=None: did you remember to .download()?


   .. py:method:: read_variants() -> collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]

      Yields rows from a scoring file as ScoreVariants

      ScoreVariants are pydantic models with data validation (PGS Catalog standards)

      >>> from pgscatalog.core.lib._config import Config
      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
      >>> sf = ScoringFile(testpath)
      >>> variants = sf.read_variants()
      >>> for i, variant in enumerate(variants):
      ...     variant
      ...     if i == 2:
      ...         break
      ScoreVariant(rsID='rs10936599', chr_name='3', ...
      ScoreVariant(rsID='rs6061231', chr_name='20', ...
      ScoreVariant(rsID='rs10774214', chr_name='12', ...


   .. py:property:: genome_build
      :type: pgscatalog.core.lib.GenomeBuild


   .. py:property:: header


   .. py:property:: is_harmonised
      :type: bool


   .. py:property:: is_wide
      :type: bool


   .. py:property:: pgs_id
      :type: str


   .. py:property:: target_build
      :type: pgscatalog.core.lib.GenomeBuild


      The ``GenomeBuild`` you want a ``ScoringFile`` to align to. Useful when using PGS
      Catalog accessions to instantiate this class.


   .. py:property:: variants
      :type: collections.abc.Generator[pgscatalog.core.lib.models.ScoreVariant, None, None]


      A generator that yields rows from the scoring file as ``ScoreVariants``,
      if a local file is available (i.e. after downloading). Always available for
      class instances that have a valid local path.

      >>> from pgscatalog.core.lib._config import Config
      >>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000802_hmPOS_GRCh37.txt"
      >>> sf = ScoringFile(testpath)
      >>> for variant in sf.variants:
      ...     variant
      ...     break
      ScoreVariant(rsID='rs10936599', chr_name='3', chr_position=170974795...


.. py:class:: ScoringFiles(*args, target_build=None, **kwargs)

   This container class provides methods to work with multiple ScoringFile objects.

   You can use publications or trait accessions to instantiate:

   >>> from pgscatalog.core.lib import GenomeBuild
   >>> ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh37)
   ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh37)

   Or multiple PGS IDs:

   >>> ScoringFiles("PGS000001", "PGS000002")
   ScoringFiles('PGS000001', 'PGS000002', target_build=None)

   List input is OK too:

   >>> ScoringFiles(["PGS000001", "PGS000002"])
   ScoringFiles('PGS000001', 'PGS000002', target_build=None)

   Or any mixture of publications, traits, and scores:

   >>> ScoringFiles("PGP000001", "PGS000001", "PGS000002")
   ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)

   Scoring files with duplicate PGS IDs (accessions) are automatically dropped.
   In the example above ``PGP000001`` contains ``PGS000001``, ``PGS000002``, and ``PGS000003``.

   Traits can have children. To include these traits, use the ``include_children`` parameter:

   >>> score_with_children = ScoringFiles("MONDO_0004975", include_children=True)
   >>> score_wo_children = ScoringFiles("MONDO_0004975", include_children=False)
   >>> len(score_with_children) > len(score_wo_children)
   True

   For example, Alzheimer's disease (``MONDO_0004975``) includes Late-onset Alzheier's disease (``EFO_1001870``) as a child trait.

   Concatenation works as expected:

   >>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003')
   ScoringFiles('PGS000001', 'PGS000002', 'PGS000003', target_build=None)

   But only :class:`ScoringFiles` with the same genome build can be concatenated:

   >>> ScoringFiles('PGS000001') + ScoringFiles('PGS000002', 'PGS000003', target_build=GenomeBuild.GRCh38)
   Traceback (most recent call last):
   ...
   TypeError: unsupported operand type(s) for +: 'ScoringFiles' and 'ScoringFiles'

   Multiplication doesn't make sense, because :class:`ScoringFile` elements must be unique,
   so isn't supported.

   >>> ScoringFiles('PGS000001') * 3
   Traceback (most recent call last):
   ...
   TypeError: unsupported operand type(s) for *: 'ScoringFiles' and 'int'

   You can slice and iterate over :class:`ScoringFiles`:

   >>> score = ScoringFiles("PGP000001", target_build=GenomeBuild.GRCh38)
   >>> score[0]
   ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)
   >>> for x in score:
   ...     x
   ScoringFile('PGS000001', target_build=GenomeBuild.GRCh38)
   ScoringFile('PGS000002', target_build=GenomeBuild.GRCh38)
   ScoringFile('PGS000003', target_build=GenomeBuild.GRCh38)
   >>> score[0] in score
   True

   The accession validation rules apply from :class:`ScoringFile`:

   >>> ScoringFiles("PGPpotato")
   Traceback (most recent call last):
   ...
   pgscatalog.core.lib.pgsexceptions.InvalidAccessionError: No Catalog result for accession 'PGPpotato'

   Local files can also be used to instantiate :class:`ScoringFiles`:

   >>> import tempfile
   >>> with tempfile.TemporaryDirectory() as d:
   ...     x = ScoringFile("PGS000001", target_build=GenomeBuild.GRCh38)
   ...     x.download(directory=d)
   ...     ScoringFiles(x.local_path) # doctest: +ELLIPSIS
   ScoringFiles('.../PGS000001_hmPOS_GRCh38.txt.gz', target_build=None)

   But the ``target_build`` parameter doesn't work with local files:

   >>> with tempfile.TemporaryDirectory() as d:
   ...     x = ScoringFile("PGS000002", target_build=GenomeBuild.GRCh38)
   ...     x.download(directory=d)
   ...     ScoringFiles(x.local_path, target_build=GenomeBuild.GRCh37)
   Traceback (most recent call last):
   ...
   ValueError: Can't load local scoring file when target_build is setTry .normalise() method to do liftover, or load harmonised scoring files from PGS Catalog

   If you have a local scoring file that needs to change genome build, and using PGS
   Catalog harmonised data isn't an option, you should make a :class:`ScoringFile` from a path, then
   use the ``normalise()`` method with liftover enabled.


   .. py:property:: elements

      Returns a list of :class:`ScoringFile` objects contained inside :class:`ScoringFiles`


   .. py:attribute:: target_build
      :value: None


.. py:data:: logger