calc.lib.legacy.polygenicscore
==============================

.. py:module:: calc.lib.legacy.polygenicscore


Attributes
----------

.. autoapisummary::

   calc.lib.legacy.polygenicscore.logger


Classes
-------

.. autoapisummary::

   calc.lib.legacy.polygenicscore.AdjustArguments
   calc.lib.legacy.polygenicscore.AdjustResults
   calc.lib.legacy.polygenicscore.AggregatedPGS
   calc.lib.legacy.polygenicscore.PolygenicScore


Module Contents
---------------

.. py:class:: AdjustArguments

   Arguments that control genetic similarity estimation and PGS adjustment

   >>> AdjustArguments(method_compare="Mahalanobis", pThreshold=None, method_normalization=("empirical", "mean"))
   AdjustArguments(method_compare='Mahalanobis', pThreshold=None, method_normalization=('empirical', 'mean'))


   .. py:attribute:: method_compare
      :type:  str
      :value: 'RandomForest'


   .. py:attribute:: method_normalization
      :type:  tuple[str, Ellipsis]
      :value: ('empirical', 'mean', 'mean+var')


   .. py:attribute:: pThreshold
      :type:  float | None
      :value: None


.. py:class:: AdjustResults

   Results returned by :class:`AggregatedPGS.adjust()`


   .. py:method:: write(directory)

      Write model, PGS, and PCA data to a directory


   .. py:attribute:: model_meta
      :type:  dict


   .. py:attribute:: models
      :type:  pandas.DataFrame


   .. py:attribute:: pca
      :type:  pandas.DataFrame


   .. py:attribute:: pgs
      :type:  pandas.DataFrame


   .. py:attribute:: scorecols
      :type:  list[str]


   .. py:attribute:: target_label
      :type:  str


.. py:class:: AggregatedPGS(*, target_name, df=None, path=None)

   A PGS that's been aggregated, melted, and probably contains samples from a reference panel and a target population.

   The most useful method in this class adjusts PGS based on :func:`genetic ancestry similarity estimation <pgscatalog.calc.AggregatedPGS.adjust>`.

   >>> from ._config import Config
   >>> score_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "aggregated_scores.txt.gz"
   >>> AggregatedPGS(path=score_path, target_name="hgdp")
   AggregatedPGS(path=PosixPath('.../aggregated_scores.txt.gz'))


   .. py:method:: adjust(*, ref_pc, target_pc, adjust_arguments=None)

      Adjust a PGS based on genetic ancestry similarity estimations.

      :returns: :class:`AdjustResults`

      >>> from ._config import Config
      >>> from .principalcomponents import PrincipalComponents
      >>> related_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "ref.king.cutoff.id"
      >>> ref_pc = PrincipalComponents(pcs_path=[Config.ROOT_DIR / "tests" / "legacy" /"data" / "ref.pcs"], dataset="reference", psam_path=Config.ROOT_DIR / "tests" / "legacy" /"data" / "ref.psam", pop_type=PopulationType.REFERENCE, related_path=related_path)
      >>> target_pcs = PrincipalComponents(pcs_path=Config.ROOT_DIR / "tests" / "legacy" / "data" / "target.pcs", dataset="target", pop_type=PopulationType.TARGET)
      >>> score_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "aggregated_scores.txt"
      >>> results = AggregatedPGS(path=score_path, target_name="hgdp").adjust(ref_pc=ref_pc, target_pc=target_pcs)
      >>> results.pgs.to_dict().keys()
      dict_keys(['SUM|PGS001229_hmPOS_GRCh38', 'percentile_MostSimilarPop|PGS001229_hmPOS_GRCh38', 'Z_MostSimilarPop|PGS001229_hmPOS_GRCh38', ...

      >>> results.models
      {'dist_empirical': {'PGS001229_hmPOS_GRCh38': {'EUR': {'percentiles': array([-1.04069000e+01, -7.94665080e+00, ...

      Write the adjusted results to a directory:

      >>> import tempfile, os
      >>> dout = tempfile.mkdtemp()
      >>> results.write(directory=dout)
      >>> sorted(os.listdir(dout))
      ['target_info.json.gz', 'target_pgs.txt.gz', 'target_popsimilarity.txt.gz']


   .. py:property:: df


   .. py:property:: path


   .. py:property:: target_name


.. py:class:: PolygenicScore(*, path=None, df=None, sampleset=None)

   Represents the output of ``plink2 --score`` written to a file

   >>> from ._config import Config
   >>> import reprlib
   >>> score1 = Config.ROOT_DIR / "tests" / "legacy" / "data" / "cineca_22_additive_0.sscore.zst"
   >>> pgs1 = PolygenicScore(sampleset="test", path=score1)  # doctest: +ELLIPSIS
   >>> pgs1
   PolygenicScore(sampleset='test', path=PosixPath('.../cineca_22_additive_0.sscore.zst'))
   >>> pgs2 = PolygenicScore(sampleset="test", path=score1)
   >>> reprlib.repr(pgs1.read().to_dict()) # doctest: +ELLIPSIS
   "{'DENOM': {('test', 'HG00096', 'HG00096'): 1564, ... 'PGS001229_22_SUM': {('test', 'HG00096', 'HG00096'): 0.54502, ...

   It's often helpful to combine PGS that were split per chromosome or by effect type:

   >>> aggregated_score = pgs1 + pgs2
   >>> aggregated_score  # doctest: +ELLIPSIS
   PolygenicScore(sampleset='test', path='(in-memory)')

   Once a score has been fully aggregated it can be helpful to recalculate an average:

   >>> aggregated_score.average()
   >>> aggregated_score.df  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
                                       PGS       SUM  DENOM       AVG
   sampleset FID     IID
   test      HG00096 HG00096  PGS001229_22  1.090040   3128  0.000348
             HG00097 HG00097  PGS001229_22  1.348802   3128  0.000431
   ...

   Scores can be written to a TSV file:

   >>> import tempfile, os
   >>> outd = tempfile.mkdtemp()
   >>> aggregated_score.write(str(outd))
   >>> os.listdir(outd)
   ['aggregated_scores.txt.gz']

   With support for splitting output files by sampleset:

   >>> splitoutd = tempfile.mkdtemp()
   >>> aggregated_score.write(splitoutd, split=True)
   >>> sorted(os.listdir(splitoutd), key = lambda x: x.split("_")[0])
   ['test_pgs.txt.gz']

   If a sampleset can't be inferred from argument or path, error:
   >>> PolygenicScore()
   Traceback (most recent call last):
   ...
   TypeError: Missing sampleset


   .. py:method:: average()

      Update the dataframe with a recalculated average.


   .. py:method:: melt()

      Update the dataframe with a melted version (wide format to long format)


   .. py:method:: read()

      Eagerly load a PGS into a pandas dataframe

      If the FID column can be missing from the input data:

      >>> from ._config import Config
      >>> from xopen import xopen
      >>> score1 = Config.ROOT_DIR / "tests" / "legacy" / "data" / "cineca_22_additive_0.sscore.zst"
      >>> with xopen(score1) as f:
      ...     f.readline().split()
      ['#IID', 'ALLELE_CT', 'DENOM', 'NAMED_ALLELE_DOSAGE_SUM', 'PGS001229_22_AVG', 'PGS001229_22_SUM']

      Then FID is set to IID:

      >>> PolygenicScore(sampleset="test", path=score1).read()  # doctest: +ELLIPSIS,+NORMALIZE_WHITESPACE
                                  DENOM  PGS001229_22_SUM
      sampleset FID     IID
      test      HG00096 HG00096   1564          0.545020
      ...


   .. py:method:: write(outdir, split=False)

      Write PGS to a compressed TSV


   .. py:property:: df


   .. py:property:: path


.. py:data:: logger