calc.lib.legacy.polygenicscore

Attributes

logger

Classes

AdjustArguments

Arguments that control genetic similarity estimation and PGS adjustment

AdjustResults

Results returned by AggregatedPGS.adjust()

AggregatedPGS

A PGS that's been aggregated, melted, and probably contains samples from a reference panel and a target population.

PolygenicScore

Represents the output of plink2 --score written to a file

Module Contents

class calc.lib.legacy.polygenicscore.AdjustArguments

Arguments that control genetic similarity estimation and PGS adjustment

>>> AdjustArguments(method_compare="Mahalanobis", pThreshold=None, method_normalization=("empirical", "mean"))
AdjustArguments(method_compare='Mahalanobis', pThreshold=None, method_normalization=('empirical', 'mean'))
method_compare: str = 'RandomForest'
method_normalization: tuple[str, Ellipsis] = ('empirical', 'mean', 'mean+var')
pThreshold: float | None = None
class calc.lib.legacy.polygenicscore.AdjustResults

Results returned by AggregatedPGS.adjust()

write(directory)

Write model, PGS, and PCA data to a directory

model_meta: dict
models: pandas.DataFrame
pca: pandas.DataFrame
pgs: pandas.DataFrame
scorecols: list[str]
target_label: str
class calc.lib.legacy.polygenicscore.AggregatedPGS(*, target_name, df=None, path=None)

A PGS that’s been aggregated, melted, and probably contains samples from a reference panel and a target population.

The most useful method in this class adjusts PGS based on genetic ancestry similarity estimation.

>>> from ._config import Config
>>> score_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "aggregated_scores.txt.gz"
>>> AggregatedPGS(path=score_path, target_name="hgdp")
AggregatedPGS(path=PosixPath('.../aggregated_scores.txt.gz'))
adjust(*, ref_pc, target_pc, adjust_arguments=None)

Adjust a PGS based on genetic ancestry similarity estimations.

Returns:

AdjustResults

>>> from ._config import Config
>>> from .principalcomponents import PrincipalComponents
>>> related_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "ref.king.cutoff.id"
>>> ref_pc = PrincipalComponents(pcs_path=[Config.ROOT_DIR / "tests" / "legacy" /"data" / "ref.pcs"], dataset="reference", psam_path=Config.ROOT_DIR / "tests" / "legacy" /"data" / "ref.psam", pop_type=PopulationType.REFERENCE, related_path=related_path)
>>> target_pcs = PrincipalComponents(pcs_path=Config.ROOT_DIR / "tests" / "legacy" / "data" / "target.pcs", dataset="target", pop_type=PopulationType.TARGET)
>>> score_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "aggregated_scores.txt"
>>> results = AggregatedPGS(path=score_path, target_name="hgdp").adjust(ref_pc=ref_pc, target_pc=target_pcs)
>>> results.pgs.to_dict().keys()
dict_keys(['SUM|PGS001229_hmPOS_GRCh38', 'percentile_MostSimilarPop|PGS001229_hmPOS_GRCh38', 'Z_MostSimilarPop|PGS001229_hmPOS_GRCh38', ...
>>> results.models
{'dist_empirical': {'PGS001229_hmPOS_GRCh38': {'EUR': {'percentiles': array([-1.04069000e+01, -7.94665080e+00, ...

Write the adjusted results to a directory:

>>> import tempfile, os
>>> dout = tempfile.mkdtemp()
>>> results.write(directory=dout)
>>> sorted(os.listdir(dout))
['target_info.json.gz', 'target_pgs.txt.gz', 'target_popsimilarity.txt.gz']
property df
property path
property target_name
class calc.lib.legacy.polygenicscore.PolygenicScore(*, path=None, df=None, sampleset=None)

Represents the output of plink2 --score written to a file

>>> from ._config import Config
>>> import reprlib
>>> score1 = Config.ROOT_DIR / "tests" / "legacy" / "data" / "cineca_22_additive_0.sscore.zst"
>>> pgs1 = PolygenicScore(sampleset="test", path=score1)
>>> pgs1
PolygenicScore(sampleset='test', path=PosixPath('.../cineca_22_additive_0.sscore.zst'))
>>> pgs2 = PolygenicScore(sampleset="test", path=score1)
>>> reprlib.repr(pgs1.read().to_dict())
"{'DENOM': {('test', 'HG00096', 'HG00096'): 1564, ... 'PGS001229_22_SUM': {('test', 'HG00096', 'HG00096'): 0.54502, ...

It’s often helpful to combine PGS that were split per chromosome or by effect type:

>>> aggregated_score = pgs1 + pgs2
>>> aggregated_score
PolygenicScore(sampleset='test', path='(in-memory)')

Once a score has been fully aggregated it can be helpful to recalculate an average:

>>> aggregated_score.average()
>>> aggregated_score.df
                                    PGS       SUM  DENOM       AVG
sampleset FID     IID
test      HG00096 HG00096  PGS001229_22  1.090040   3128  0.000348
          HG00097 HG00097  PGS001229_22  1.348802   3128  0.000431
...

Scores can be written to a TSV file:

>>> import tempfile, os
>>> outd = tempfile.mkdtemp()
>>> aggregated_score.write(str(outd))
>>> os.listdir(outd)
['aggregated_scores.txt.gz']

With support for splitting output files by sampleset:

>>> splitoutd = tempfile.mkdtemp()
>>> aggregated_score.write(splitoutd, split=True)
>>> sorted(os.listdir(splitoutd), key = lambda x: x.split("_")[0])
['test_pgs.txt.gz']

If a sampleset can’t be inferred from argument or path, error: >>> PolygenicScore() Traceback (most recent call last): … TypeError: Missing sampleset

average()

Update the dataframe with a recalculated average.

melt()

Update the dataframe with a melted version (wide format to long format)

read()

Eagerly load a PGS into a pandas dataframe

If the FID column can be missing from the input data:

>>> from ._config import Config
>>> from xopen import xopen
>>> score1 = Config.ROOT_DIR / "tests" / "legacy" / "data" / "cineca_22_additive_0.sscore.zst"
>>> with xopen(score1) as f:
...     f.readline().split()
['#IID', 'ALLELE_CT', 'DENOM', 'NAMED_ALLELE_DOSAGE_SUM', 'PGS001229_22_AVG', 'PGS001229_22_SUM']

Then FID is set to IID:

>>> PolygenicScore(sampleset="test", path=score1).read()
                            DENOM  PGS001229_22_SUM
sampleset FID     IID
test      HG00096 HG00096   1564          0.545020
...
write(outdir, split=False)

Write PGS to a compressed TSV

property df
property path
calc.lib.legacy.polygenicscore.logger