match.lib.matchresult

Attributes

logger

Classes

MatchResult

Represents variants in a scoring file matched against variants in a target genome

MatchResults

Container for MatchResult

Module Contents

class match.lib.matchresult.MatchResult(dataset, matchresult=None, ipc_path=None)

Represents variants in a scoring file matched against variants in a target genome

When matching a scoring file, it’s normal for matches to be composed of many MatchResult objects. This is common if the target genome is split to have one chromosome per scoring file, and the container class MatchResults provides some helpful methods for working with split data.

>>> from ._config import Config
>>> from .variantframe import VariantFrame
>>> from .scoringfileframe import ScoringFileFrame, match_variants
>>> target_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "hapnest.bim"
>>> target = VariantFrame(target_path, dataset="hapnest")
>>> score_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "combined.txt.gz"
>>> scorefile = ScoringFileFrame(score_path)

A MatchResult can be instantiated with the lazyframe output of the match_variants function:

>>> with target as target_df, scorefile as score_df:
...     match_variants(score_df=score_df, target_df=target_df, target=target)
MatchResult(dataset=hapnest, matchresult=[<LazyFrame...], ipc_path=None, df=None)

A MatchResult can also be saved to and loaded from Arrow IPC files:

>>> import tempfile
>>> fout = tempfile.NamedTemporaryFile(delete=False)
>>> with target as target_df, scorefile as score_df:
...     results = match_variants(score_df=score_df, target_df=target_df, target=target)
...     _ = results.collect(outfile=fout.name)
>>> x = MatchResult.from_ipc(fout.name, dataset="hapnest")
>>> x
MatchResult(dataset=hapnest, matchresult=None, ipc_path=..., df=<LazyFrame...>)
collect(outfile=None)

Compute match results and optionally save to file

classmethod from_ipc(matchresults_ipc_path, dataset)

Create an instance from an Arrow IPC file

dataset
df = None
ipc_path = None
class match.lib.matchresult.MatchResults(*elements)

Container for MatchResult

Useful for making matching logs and writing scoring files ready to be used by plink2 --score

>>> import tempfile, os, glob, pathlib
>>> from ._config import Config
>>> from .variantframe import VariantFrame
>>> from .scoringfileframe import ScoringFileFrame, match_variants
>>> fout = tempfile.NamedTemporaryFile(delete=False)
>>> target_path = Config.ROOT_DIR / "tests" / "data" / "good_match.pvar"
>>> score_path =  Config.ROOT_DIR / "tests" / "data" / "good_match_scorefile.txt"
>>> target = VariantFrame(target_path, dataset="goodmatch")
>>> scorefile = ScoringFileFrame(score_path)
>>> foutdir, splitfoutdir = tempfile.mkdtemp(), tempfile.mkdtemp()

Using a context manager is really important to prepare ScoringFileFrame and VariantFrame data frames:

>>> with target as target_df, scorefile as score_df:
...     results = match_variants(score_df=score_df, target_df=target_df, target=target)
...     _ = results.collect(outfile=fout.name)

These data frames are transparently backed by Arrow IPC files on disk.

>>> with scorefile as score_df:
...     x = MatchResult.from_ipc(fout.name, dataset="goodmatch")
...     _ = MatchResults(x).write_scorefiles(directory=foutdir, score_df=score_df)
...     _ = MatchResults(x).write_scorefiles(directory=splitfoutdir, split=True, score_df=score_df)
>>> MatchResults(x)
MatchResults([MatchResult(dataset=goodmatch, matchresult=None, ipc_path=...])

By default, scoring files are written with multiple chromosomes per file:

>>> combined_paths = sorted(glob.glob(foutdir + "/*ALL*"), key=lambda x: pathlib.Path(x).stem)
>>> combined_paths
['.../goodmatch_ALL_additive_0.scorefile.gz', '.../goodmatch_ALL_dominant_0.scorefile.gz', '.../goodmatch_ALL_recessive_0.scorefile.gz']
>>> assert len(combined_paths) == 3

Scoring files can be split. The input scoring file contains 20 unique chromosomes, with one additive + dominant effect file (but one chromosome didn’t match well):

>>> scorefiles = sorted(os.listdir(splitfoutdir))
>>> scorefiles
['goodmatch_10_additive_0.scorefile.gz', 'goodmatch_11_additive_0.scorefile.gz', ...]
>>> sum("dominant" in f for f in scorefiles)
1
>>> sum("recessive" in f for f in scorefiles)
1
>>> sum("additive" in f for f in scorefiles)
19
>>> assert len(scorefiles) == 21

An important part of matching variants is reporting a log to see how well you’re reproducing a PGS in the new target genomes:

>>> with pl.Config(tbl_formatting="ASCII_MARKDOWN", tbl_hide_column_data_types=True, tbl_width_chars=120), scorefile as score_df:
...     MatchResults(x).full_variant_log(score_df).fetch()  # +ELLIPSIS
shape: (169, 23)
| row_nr | accession | chr_name | chr_position | … | duplicate_ID | match_IDs | match_status | dataset   |
|--------|-----------|----------|--------------|---|--------------|-----------|--------------|-----------|
| 0      | PGS000002 | 11       | 69331418     | … | true         | NA        | excluded     | goodmatch |
| 1      | PGS000002 | 11       | 69379161     | … | false        | NA        | matched      | goodmatch |
| 2      | PGS000002 | 11       | 69331642     | … | false        | NA        | excluded     | goodmatch |
| 2      | PGS000002 | 11       | 69331642     | … | false        | NA        | not_best     | goodmatch |
| 3      | PGS000002 | 5        | 1282319      | … | false        | NA        | matched      | goodmatch |
| …      | …         | …        | …            | … | …            | …         | …            | …         |
| 73     | PGS000001 | 1        | 204518842    | … | false        | NA        | matched      | goodmatch |
| 74     | PGS000001 | 1        | 202187176    | … | false        | NA        | matched      | goodmatch |
| 75     | PGS000001 | 2        | 19320803     | … | false        | NA        | matched      | goodmatch |
| 76     | PGS000001 | 16       | 53855291     | … | false        | NA        | excluded     | goodmatch |
| 76     | PGS000001 | 16       | 53855291     | … | false        | NA        | not_best     | goodmatch |
filter(score_df, min_overlap=0.75, **kwargs)

Filter match candidates after labelling according to user parameters

full_variant_log(score_df, **kwargs)

Generate a log for each variant in a scoring file

Multiple match candidates may exist for each variant in the original file. Describe each variant (one variant per row) with match metadata

label(keep_first_match=False, remove_ambiguous=True, skip_flip=False, remove_multiallelic=True, filter_IDs=None)

Label match candidates according to matching parameters

kwargs control labelling parameters:

  • keep_first_match: if best match candidates are tied, keep the first? (default: `False, drop all candidates for this variant)

  • remove_ambiguous: Remove ambiguous alleles? (default: True)

  • skip_flip: Consider matched variants that may be reported on the opposite strand (default: False)

  • remove_multiallelic remove multiallelic variants before matching (default: True)

  • filter_IDs: constrain variants to this list of IDs (default, don’t constrain)

write_scorefiles(directory, score_df, split=False, min_overlap=0.75, **kwargs)

Write matches to a set of files ready for plink2 --score

Does some helpful stuff:

  • Labels match candidates

  • Filters match candidates based on labels and user configuration

  • Calculates match rates to see how well the PGS reproduces in the new target genomes

  • Generates a filtered variant log containing the best match candidate

  • Checks if the number of variants in the summary log matches the input scoring file

  • Sets up parallel score calculation (pivots data to wide column format)

  • Writes scores to a directory, splitting based on chromosome and effect type

dataset
property df: polars.LazyFrame

A df containing raw match results

property filter_summary: polars.DataFrame

A log that summarises the impact of filtering

property filtered_matches: polars.LazyFrame

A df containing up to one row per variant (the best possible match)

property match_candidates: polars.LazyFrame

A df containing all possible matches for each input score variant

property summary_log: polars.DataFrame

A summary log containing match rates for variants

match.lib.matchresult.logger