match.lib.matchresult
=====================

.. py:module:: match.lib.matchresult


Attributes
----------

.. autoapisummary::

   match.lib.matchresult.logger


Classes
-------

.. autoapisummary::

   match.lib.matchresult.MatchResult
   match.lib.matchresult.MatchResults


Module Contents
---------------

.. py:class:: MatchResult(dataset, matchresult=None, ipc_path=None)

   Represents variants in a scoring file matched against variants in a target genome

   When matching a scoring file, it's normal for matches to be composed of many
   :class:`MatchResult` objects. This is common if the target genome is split to
   have one chromosome per scoring file, and the container class
   :class:`MatchResults` provides some helpful methods for working with split data.

   >>> from ._config import Config
   >>> from .variantframe import VariantFrame
   >>> from .scoringfileframe import ScoringFileFrame, match_variants
   >>> target_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "hapnest.bim"
   >>> target = VariantFrame(target_path, dataset="hapnest")
   >>> score_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "combined.txt.gz"
   >>> scorefile = ScoringFileFrame(score_path)

   A :class:`MatchResult` can be instantiated with the lazyframe output of the match_variants function:

   >>> with target as target_df, scorefile as score_df:
   ...     match_variants(score_df=score_df, target_df=target_df, target=target)  # doctest: +ELLIPSIS
   MatchResult(dataset=hapnest, matchresult=[<LazyFrame...], ipc_path=None, df=None)

   A :class:`MatchResult` can also be saved to and loaded from Arrow IPC files:

   >>> import tempfile
   >>> fout = tempfile.NamedTemporaryFile(delete=False)
   >>> with target as target_df, scorefile as score_df:
   ...     results = match_variants(score_df=score_df, target_df=target_df, target=target)
   ...     _ = results.collect(outfile=fout.name)
   >>> x = MatchResult.from_ipc(fout.name, dataset="hapnest")
   >>> x  # doctest: +ELLIPSIS
   MatchResult(dataset=hapnest, matchresult=None, ipc_path=..., df=<LazyFrame...>)


   .. py:method:: collect(outfile=None)

      Compute match results and optionally save to file


   .. py:method:: from_ipc(matchresults_ipc_path, dataset)
      :classmethod:


      Create an instance from an Arrow IPC file


   .. py:attribute:: dataset


   .. py:attribute:: df
      :value: None


   .. py:attribute:: ipc_path
      :value: None


.. py:class:: MatchResults(*elements)


   Container for :class:`MatchResult`

   Useful for making matching logs and writing scoring files ready to be used by ``plink2 --score``

   >>> import tempfile, os, glob, pathlib
   >>> from ._config import Config
   >>> from .variantframe import VariantFrame
   >>> from .scoringfileframe import ScoringFileFrame, match_variants
   >>> fout = tempfile.NamedTemporaryFile(delete=False)
   >>> target_path = Config.ROOT_DIR / "tests" / "data" / "good_match.pvar"
   >>> score_path =  Config.ROOT_DIR / "tests" / "data" / "good_match_scorefile.txt"
   >>> target = VariantFrame(target_path, dataset="goodmatch")
   >>> scorefile = ScoringFileFrame(score_path)
   >>> foutdir, splitfoutdir = tempfile.mkdtemp(), tempfile.mkdtemp()

   Using a context manager is really important to prepare :class:`ScoringFileFrame` and :class:`VariantFrame` data frames:

   >>> with target as target_df, scorefile as score_df:
   ...     results = match_variants(score_df=score_df, target_df=target_df, target=target)
   ...     _ = results.collect(outfile=fout.name)

   These data frames are transparently backed by Arrow IPC files on disk.

   >>> with scorefile as score_df:
   ...     x = MatchResult.from_ipc(fout.name, dataset="goodmatch")
   ...     _ = MatchResults(x).write_scorefiles(directory=foutdir, score_df=score_df)
   ...     _ = MatchResults(x).write_scorefiles(directory=splitfoutdir, split=True, score_df=score_df)
   >>> MatchResults(x)  # doctest: +ELLIPSIS
   MatchResults([MatchResult(dataset=goodmatch, matchresult=None, ipc_path=...])

   By default, scoring files are written with multiple chromosomes per file:

   >>> combined_paths = sorted(glob.glob(foutdir + "/*ALL*"), key=lambda x: pathlib.Path(x).stem)
   >>> combined_paths # doctest: +ELLIPSIS
   ['.../goodmatch_ALL_additive_0.scorefile.gz', '.../goodmatch_ALL_dominant_0.scorefile.gz', '.../goodmatch_ALL_recessive_0.scorefile.gz']
   >>> assert len(combined_paths) == 3

   Scoring files can be split. The input scoring file contains 20 unique
   chromosomes, with one additive + dominant effect file (but one chromosome didn't match well):

   >>> scorefiles = sorted(os.listdir(splitfoutdir))
   >>> scorefiles # doctest: +ELLIPSIS
   ['goodmatch_10_additive_0.scorefile.gz', 'goodmatch_11_additive_0.scorefile.gz', ...]
   >>> sum("dominant" in f for f in scorefiles)
   1
   >>> sum("recessive" in f for f in scorefiles)
   1
   >>> sum("additive" in f for f in scorefiles)
   19
   >>> assert len(scorefiles) == 21

   An important part of matching variants is reporting a log to see how well you're reproducing a PGS in the new target genomes:

   >>> with pl.Config(tbl_formatting="ASCII_MARKDOWN", tbl_hide_column_data_types=True, tbl_width_chars=120), scorefile as score_df:
   ...     MatchResults(x).full_variant_log(score_df).fetch()  # +ELLIPSIS
   shape: (169, 23)
   | row_nr | accession | chr_name | chr_position | … | duplicate_ID | match_IDs | match_status | dataset   |
   |--------|-----------|----------|--------------|---|--------------|-----------|--------------|-----------|
   | 0      | PGS000002 | 11       | 69331418     | … | true         | NA        | excluded     | goodmatch |
   | 1      | PGS000002 | 11       | 69379161     | … | false        | NA        | matched      | goodmatch |
   | 2      | PGS000002 | 11       | 69331642     | … | false        | NA        | excluded     | goodmatch |
   | 2      | PGS000002 | 11       | 69331642     | … | false        | NA        | not_best     | goodmatch |
   | 3      | PGS000002 | 5        | 1282319      | … | false        | NA        | matched      | goodmatch |
   | …      | …         | …        | …            | … | …            | …         | …            | …         |
   | 73     | PGS000001 | 1        | 204518842    | … | false        | NA        | matched      | goodmatch |
   | 74     | PGS000001 | 1        | 202187176    | … | false        | NA        | matched      | goodmatch |
   | 75     | PGS000001 | 2        | 19320803     | … | false        | NA        | matched      | goodmatch |
   | 76     | PGS000001 | 16       | 53855291     | … | false        | NA        | excluded     | goodmatch |
   | 76     | PGS000001 | 16       | 53855291     | … | false        | NA        | not_best     | goodmatch |


   .. py:method:: filter(score_df, min_overlap=0.75, **kwargs)

      Filter match candidates after labelling according to user parameters


   .. py:method:: full_variant_log(score_df, **kwargs)

      Generate a log for each variant in a scoring file

      Multiple match candidates may exist for each variant in the original file.
      Describe each variant (one variant per row) with match metadata


   .. py:method:: label(keep_first_match=False, remove_ambiguous=True, skip_flip=False, remove_multiallelic=True, filter_IDs=None)

      Label match candidates according to matching parameters

      kwargs control labelling parameters:

      * ``keep_first_match``: if best match candidates are tied, keep the first? (default: ```False``, drop all candidates for this variant)
      * ``remove_ambiguous``: Remove ambiguous alleles? (default: ``True``)
      * ``skip_flip``: Consider matched variants that may be reported on the opposite strand (default: ``False``)
      * ``remove_multiallelic`` remove multiallelic variants before matching (default: ``True``)
      * ``filter_IDs``: constrain variants to this list of IDs (default, don't constrain)


   .. py:method:: write_scorefiles(directory, score_df, split=False, min_overlap=0.75, **kwargs)

      Write matches to a set of files ready for ``plink2 --score``

      Does some helpful stuff:

      * Labels match candidates
      * Filters match candidates based on labels and user configuration
      * Calculates match rates to see how well the PGS reproduces in the new target genomes
      * Generates a filtered variant log containing the best match candidate
      * Checks if the number of variants in the summary log matches the input scoring file
      * Sets up parallel score calculation (pivots data to wide column format)
      * Writes scores to a directory, splitting based on chromosome and effect type


   .. py:attribute:: dataset


   .. py:property:: df
      :type: polars.LazyFrame


      A df containing raw match results


   .. py:property:: filter_summary
      :type: polars.DataFrame


      A log that summarises the impact of filtering


   .. py:property:: filtered_matches
      :type: polars.LazyFrame


      A df containing up to one row per variant (the best possible match)


   .. py:property:: match_candidates
      :type: polars.LazyFrame


      A df containing all possible matches for each input score variant


   .. py:property:: summary_log
      :type: polars.DataFrame


      A summary log containing match rates for variants


.. py:data:: logger