match.lib.matchresult ===================== .. py:module:: match.lib.matchresult Attributes ---------- .. autoapisummary:: match.lib.matchresult.logger Classes ------- .. autoapisummary:: match.lib.matchresult.MatchResult match.lib.matchresult.MatchResults Module Contents --------------- .. py:class:: MatchResult(dataset, matchresult=None, ipc_path=None) Represents variants in a scoring file matched against variants in a target genome When matching a scoring file, it's normal for matches to be composed of many :class:`MatchResult` objects. This is common if the target genome is split to have one chromosome per scoring file, and the container class :class:`MatchResults` provides some helpful methods for working with split data. >>> from ._config import Config >>> from .variantframe import VariantFrame >>> from .scoringfileframe import ScoringFileFrame, match_variants >>> target_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "hapnest.bim" >>> target = VariantFrame(target_path, dataset="hapnest") >>> score_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "combined.txt.gz" >>> scorefile = ScoringFileFrame(score_path) A :class:`MatchResult` can be instantiated with the lazyframe output of the match_variants function: >>> with target as target_df, scorefile as score_df: ... match_variants(score_df=score_df, target_df=target_df, target=target) # doctest: +ELLIPSIS MatchResult(dataset=hapnest, matchresult=[>> import tempfile >>> fout = tempfile.NamedTemporaryFile(delete=False) >>> with target as target_df, scorefile as score_df: ... results = match_variants(score_df=score_df, target_df=target_df, target=target) ... _ = results.collect(outfile=fout.name) >>> x = MatchResult.from_ipc(fout.name, dataset="hapnest") >>> x # doctest: +ELLIPSIS MatchResult(dataset=hapnest, matchresult=None, ipc_path=..., df=) .. py:method:: collect(outfile=None) Compute match results and optionally save to file .. py:method:: from_ipc(matchresults_ipc_path, dataset) :classmethod: Create an instance from an Arrow IPC file .. py:attribute:: dataset .. py:attribute:: df :value: None .. py:attribute:: ipc_path :value: None .. py:class:: MatchResults(*elements) Container for :class:`MatchResult` Useful for making matching logs and writing scoring files ready to be used by ``plink2 --score`` >>> import tempfile, os, glob, pathlib >>> from ._config import Config >>> from .variantframe import VariantFrame >>> from .scoringfileframe import ScoringFileFrame, match_variants >>> fout = tempfile.NamedTemporaryFile(delete=False) >>> target_path = Config.ROOT_DIR / "tests" / "data" / "good_match.pvar" >>> score_path = Config.ROOT_DIR / "tests" / "data" / "good_match_scorefile.txt" >>> target = VariantFrame(target_path, dataset="goodmatch") >>> scorefile = ScoringFileFrame(score_path) >>> foutdir, splitfoutdir = tempfile.mkdtemp(), tempfile.mkdtemp() Using a context manager is really important to prepare :class:`ScoringFileFrame` and :class:`VariantFrame` data frames: >>> with target as target_df, scorefile as score_df: ... results = match_variants(score_df=score_df, target_df=target_df, target=target) ... _ = results.collect(outfile=fout.name) These data frames are transparently backed by Arrow IPC files on disk. >>> with scorefile as score_df: ... x = MatchResult.from_ipc(fout.name, dataset="goodmatch") ... _ = MatchResults(x).write_scorefiles(directory=foutdir, score_df=score_df) ... _ = MatchResults(x).write_scorefiles(directory=splitfoutdir, split=True, score_df=score_df) >>> MatchResults(x) # doctest: +ELLIPSIS MatchResults([MatchResult(dataset=goodmatch, matchresult=None, ipc_path=...]) By default, scoring files are written with multiple chromosomes per file: >>> combined_paths = sorted(glob.glob(foutdir + "/*ALL*"), key=lambda x: pathlib.Path(x).stem) >>> combined_paths # doctest: +ELLIPSIS ['.../goodmatch_ALL_additive_0.scorefile.gz', '.../goodmatch_ALL_dominant_0.scorefile.gz', '.../goodmatch_ALL_recessive_0.scorefile.gz'] >>> assert len(combined_paths) == 3 Scoring files can be split. The input scoring file contains 20 unique chromosomes, with one additive + dominant effect file (but one chromosome didn't match well): >>> scorefiles = sorted(os.listdir(splitfoutdir)) >>> scorefiles # doctest: +ELLIPSIS ['goodmatch_10_additive_0.scorefile.gz', 'goodmatch_11_additive_0.scorefile.gz', ...] >>> sum("dominant" in f for f in scorefiles) 1 >>> sum("recessive" in f for f in scorefiles) 1 >>> sum("additive" in f for f in scorefiles) 19 >>> assert len(scorefiles) == 21 An important part of matching variants is reporting a log to see how well you're reproducing a PGS in the new target genomes: >>> with pl.Config(tbl_formatting="ASCII_MARKDOWN", tbl_hide_column_data_types=True, tbl_width_chars=120), scorefile as score_df: ... MatchResults(x).full_variant_log(score_df).fetch() # +ELLIPSIS shape: (169, 23) | row_nr | accession | chr_name | chr_position | … | duplicate_ID | match_IDs | match_status | dataset | |--------|-----------|----------|--------------|---|--------------|-----------|--------------|-----------| | 0 | PGS000002 | 11 | 69331418 | … | true | NA | excluded | goodmatch | | 1 | PGS000002 | 11 | 69379161 | … | false | NA | matched | goodmatch | | 2 | PGS000002 | 11 | 69331642 | … | false | NA | excluded | goodmatch | | 2 | PGS000002 | 11 | 69331642 | … | false | NA | not_best | goodmatch | | 3 | PGS000002 | 5 | 1282319 | … | false | NA | matched | goodmatch | | … | … | … | … | … | … | … | … | … | | 73 | PGS000001 | 1 | 204518842 | … | false | NA | matched | goodmatch | | 74 | PGS000001 | 1 | 202187176 | … | false | NA | matched | goodmatch | | 75 | PGS000001 | 2 | 19320803 | … | false | NA | matched | goodmatch | | 76 | PGS000001 | 16 | 53855291 | … | false | NA | excluded | goodmatch | | 76 | PGS000001 | 16 | 53855291 | … | false | NA | not_best | goodmatch | .. py:method:: filter(score_df, min_overlap=0.75, **kwargs) Filter match candidates after labelling according to user parameters .. py:method:: full_variant_log(score_df, **kwargs) Generate a log for each variant in a scoring file Multiple match candidates may exist for each variant in the original file. Describe each variant (one variant per row) with match metadata .. py:method:: label(keep_first_match=False, remove_ambiguous=True, skip_flip=False, remove_multiallelic=True, filter_IDs=None) Label match candidates according to matching parameters kwargs control labelling parameters: * ``keep_first_match``: if best match candidates are tied, keep the first? (default: ```False``, drop all candidates for this variant) * ``remove_ambiguous``: Remove ambiguous alleles? (default: ``True``) * ``skip_flip``: Consider matched variants that may be reported on the opposite strand (default: ``False``) * ``remove_multiallelic`` remove multiallelic variants before matching (default: ``True``) * ``filter_IDs``: constrain variants to this list of IDs (default, don't constrain) .. py:method:: write_scorefiles(directory, score_df, split=False, min_overlap=0.75, **kwargs) Write matches to a set of files ready for ``plink2 --score`` Does some helpful stuff: * Labels match candidates * Filters match candidates based on labels and user configuration * Calculates match rates to see how well the PGS reproduces in the new target genomes * Generates a filtered variant log containing the best match candidate * Checks if the number of variants in the summary log matches the input scoring file * Sets up parallel score calculation (pivots data to wide column format) * Writes scores to a directory, splitting based on chromosome and effect type .. py:attribute:: dataset .. py:property:: df :type: polars.LazyFrame A df containing raw match results .. py:property:: filter_summary :type: polars.DataFrame A log that summarises the impact of filtering .. py:property:: filtered_matches :type: polars.LazyFrame A df containing up to one row per variant (the best possible match) .. py:property:: match_candidates :type: polars.LazyFrame A df containing all possible matches for each input score variant .. py:property:: summary_log :type: polars.DataFrame A summary log containing match rates for variants .. py:data:: logger