calc.lib.legacy.principalcomponents

Attributes

logger

Classes

PopulationType

PGS can be calculated on a reference panel or target population.

PrincipalComponents

This class represents principal components analysis (PCA) data calculated by fraposa-pgsc.

Module Contents

class calc.lib.legacy.principalcomponents.PopulationType(*args, **kwds)

PGS can be calculated on a reference panel or target population.

This enum mostly helps to disambiguate instances of PrincipalComponents.

REFERENCE = 'reference'
TARGET = 'target'
class calc.lib.legacy.principalcomponents.PrincipalComponents(pcs_path, dataset, pop_type, psam_path=None, related_path=None, **kwargs)

This class represents principal components analysis (PCA) data calculated by fraposa-pgsc.

PCA data may come from a reference population or a target population. Target populations have been projected onto the reference population.

>>> from ._config import Config
>>> related_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "ref.king.cutoff.id"
>>> psam_path = Config.ROOT_DIR / "tests" / "legacy" / "data" / "ref.psam"
>>> ref_pc = PrincipalComponents(pcs_path=[Config.ROOT_DIR / "tests" / "legacy" / "data" / "ref.pcs"], dataset="reference", psam_path=psam_path, related_path=related_path, pop_type=PopulationType.REFERENCE)
>>> ref_pc
PrincipalComponents(dataset='reference', pop_type=PopulationType.REFERENCE, pcs_path=[PosixPath('.../ref.pcs')], psam_path=PosixPath('.../ref.psam'))
>>> ref_pc.df.to_dict()
{'PC1': {('reference', 'HG00096', 'HG00096'): -23.8212, ('reference', 'HG00097', 'HG00097'): -24.8106, ...
>>> target_pcs = PrincipalComponents(pcs_path=Config.ROOT_DIR / "tests" / "legacy" / "data" / "target.pcs", dataset="target", pop_type=PopulationType.TARGET)
>>> target_pcs
PrincipalComponents(dataset='target', pop_type=PopulationType.TARGET, pcs_path=[PosixPath('.../target.pcs')], psam_path=None)
>>> target_pcs.df.to_dict()
{'PC1': {('target', 'HGDP00001', 'HGDP00001'): -18.5135, ('target', 'HGDP00003', 'HGDP00003'): -18.8314, ...
dataset
property df

A pandas dataframe that contains PCA data.

Reference data also contains population label columns loaded from sample information files.

Raises:

ValueError – If the reference population consists of fewer than 100 samples

property max_pcs

The maximum number of PCs used in calculations

property npcs_norm

Number of PCs used for population normalization (default = 4)

property npcs_popcomp

Number of PCs used for population comparison (default = 5)

property pop_type

See PopulationType

property poplabel

The group label used to assign target samples that are similar to reference population groups, e.g. SAS/EUR/AFR

property psam_path

Path to a plink2 sample information file for the reference population

property related_path

Path to a plink2 kinship cutoff file

Related reference samples are removed from analysis

calc.lib.legacy.principalcomponents.logger