match.lib.matchresult¶
Attributes¶
Classes¶
Represents variants in a scoring file matched against variants in a target genome |
|
Container for |
Module Contents¶
- class match.lib.matchresult.MatchResult(dataset, matchresult=None, ipc_path=None)¶
Represents variants in a scoring file matched against variants in a target genome
When matching a scoring file, it’s normal for matches to be composed of many
MatchResultobjects. This is common if the target genome is split to have one chromosome per scoring file, and the container classMatchResultsprovides some helpful methods for working with split data.>>> from ._config import Config >>> from .variantframe import VariantFrame >>> from .scoringfileframe import ScoringFileFrame, match_variants >>> target_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "hapnest.bim" >>> target = VariantFrame(target_path, dataset="hapnest") >>> score_path = Config.ROOT_DIR.parent / "pgscatalog.core" / "tests" / "data" / "combined.txt.gz" >>> scorefile = ScoringFileFrame(score_path)
A
MatchResultcan be instantiated with the lazyframe output of the match_variants function:>>> with target as target_df, scorefile as score_df: ... match_variants(score_df=score_df, target_df=target_df, target=target) MatchResult(dataset=hapnest, matchresult=[<LazyFrame...], ipc_path=None, df=None)
A
MatchResultcan also be saved to and loaded from Arrow IPC files:>>> import tempfile >>> fout = tempfile.NamedTemporaryFile(delete=False) >>> with target as target_df, scorefile as score_df: ... results = match_variants(score_df=score_df, target_df=target_df, target=target) ... _ = results.collect(outfile=fout.name) >>> x = MatchResult.from_ipc(fout.name, dataset="hapnest") >>> x MatchResult(dataset=hapnest, matchresult=None, ipc_path=..., df=<LazyFrame...>)
- collect(outfile=None)¶
Compute match results and optionally save to file
- classmethod from_ipc(matchresults_ipc_path, dataset)¶
Create an instance from an Arrow IPC file
- dataset¶
- df = None¶
- ipc_path = None¶
- class match.lib.matchresult.MatchResults(*elements)¶
Container for
MatchResultUseful for making matching logs and writing scoring files ready to be used by
plink2 --score>>> import tempfile, os, glob, pathlib >>> from ._config import Config >>> from .variantframe import VariantFrame >>> from .scoringfileframe import ScoringFileFrame, match_variants >>> fout = tempfile.NamedTemporaryFile(delete=False) >>> target_path = Config.ROOT_DIR / "tests" / "data" / "good_match.pvar" >>> score_path = Config.ROOT_DIR / "tests" / "data" / "good_match_scorefile.txt" >>> target = VariantFrame(target_path, dataset="goodmatch") >>> scorefile = ScoringFileFrame(score_path) >>> foutdir, splitfoutdir = tempfile.mkdtemp(), tempfile.mkdtemp()
Using a context manager is really important to prepare
ScoringFileFrameandVariantFramedata frames:>>> with target as target_df, scorefile as score_df: ... results = match_variants(score_df=score_df, target_df=target_df, target=target) ... _ = results.collect(outfile=fout.name)
These data frames are transparently backed by Arrow IPC files on disk.
>>> with scorefile as score_df: ... x = MatchResult.from_ipc(fout.name, dataset="goodmatch") ... _ = MatchResults(x).write_scorefiles(directory=foutdir, score_df=score_df) ... _ = MatchResults(x).write_scorefiles(directory=splitfoutdir, split=True, score_df=score_df) >>> MatchResults(x) MatchResults([MatchResult(dataset=goodmatch, matchresult=None, ipc_path=...])
By default, scoring files are written with multiple chromosomes per file:
>>> combined_paths = sorted(glob.glob(foutdir + "/*ALL*"), key=lambda x: pathlib.Path(x).stem) >>> combined_paths ['.../goodmatch_ALL_additive_0.scorefile.gz', '.../goodmatch_ALL_dominant_0.scorefile.gz', '.../goodmatch_ALL_recessive_0.scorefile.gz'] >>> assert len(combined_paths) == 3
Scoring files can be split. The input scoring file contains 20 unique chromosomes, with one additive + dominant effect file (but one chromosome didn’t match well):
>>> scorefiles = sorted(os.listdir(splitfoutdir)) >>> scorefiles ['goodmatch_10_additive_0.scorefile.gz', 'goodmatch_11_additive_0.scorefile.gz', ...] >>> sum("dominant" in f for f in scorefiles) 1 >>> sum("recessive" in f for f in scorefiles) 1 >>> sum("additive" in f for f in scorefiles) 19 >>> assert len(scorefiles) == 21
An important part of matching variants is reporting a log to see how well you’re reproducing a PGS in the new target genomes:
>>> with pl.Config(tbl_formatting="ASCII_MARKDOWN", tbl_hide_column_data_types=True, tbl_width_chars=120), scorefile as score_df: ... MatchResults(x).full_variant_log(score_df).fetch() # +ELLIPSIS shape: (169, 23) | row_nr | accession | chr_name | chr_position | … | duplicate_ID | match_IDs | match_status | dataset | |--------|-----------|----------|--------------|---|--------------|-----------|--------------|-----------| | 0 | PGS000002 | 11 | 69331418 | … | true | NA | excluded | goodmatch | | 1 | PGS000002 | 11 | 69379161 | … | false | NA | matched | goodmatch | | 2 | PGS000002 | 11 | 69331642 | … | false | NA | excluded | goodmatch | | 2 | PGS000002 | 11 | 69331642 | … | false | NA | not_best | goodmatch | | 3 | PGS000002 | 5 | 1282319 | … | false | NA | matched | goodmatch | | … | … | … | … | … | … | … | … | … | | 73 | PGS000001 | 1 | 204518842 | … | false | NA | matched | goodmatch | | 74 | PGS000001 | 1 | 202187176 | … | false | NA | matched | goodmatch | | 75 | PGS000001 | 2 | 19320803 | … | false | NA | matched | goodmatch | | 76 | PGS000001 | 16 | 53855291 | … | false | NA | excluded | goodmatch | | 76 | PGS000001 | 16 | 53855291 | … | false | NA | not_best | goodmatch |
- filter(score_df, min_overlap=0.75, **kwargs)¶
Filter match candidates after labelling according to user parameters
- full_variant_log(score_df, **kwargs)¶
Generate a log for each variant in a scoring file
Multiple match candidates may exist for each variant in the original file. Describe each variant (one variant per row) with match metadata
- label(keep_first_match=False, remove_ambiguous=True, skip_flip=False, remove_multiallelic=True, filter_IDs=None)¶
Label match candidates according to matching parameters
kwargs control labelling parameters:
keep_first_match: if best match candidates are tied, keep the first? (default:`False, drop all candidates for this variant)remove_ambiguous: Remove ambiguous alleles? (default:True)skip_flip: Consider matched variants that may be reported on the opposite strand (default:False)remove_multiallelicremove multiallelic variants before matching (default:True)filter_IDs: constrain variants to this list of IDs (default, don’t constrain)
- write_scorefiles(directory, score_df, split=False, min_overlap=0.75, **kwargs)¶
Write matches to a set of files ready for
plink2 --scoreDoes some helpful stuff:
Labels match candidates
Filters match candidates based on labels and user configuration
Calculates match rates to see how well the PGS reproduces in the new target genomes
Generates a filtered variant log containing the best match candidate
Checks if the number of variants in the summary log matches the input scoring file
Sets up parallel score calculation (pivots data to wide column format)
Writes scores to a directory, splitting based on chromosome and effect type
- dataset¶
- property df: polars.LazyFrame¶
A df containing raw match results
- property filter_summary: polars.DataFrame¶
A log that summarises the impact of filtering
- property filtered_matches: polars.LazyFrame¶
A df containing up to one row per variant (the best possible match)
- property match_candidates: polars.LazyFrame¶
A df containing all possible matches for each input score variant
- property summary_log: polars.DataFrame¶
A summary log containing match rates for variants
- match.lib.matchresult.logger¶