core.lib.models¶

PGS Catalog pydantic models for data validation

Best way to reuse:

from pgscatalog.core import models and use models.CatalogScoreVariant(**d)

import pgscatalog.core and use fully qualified name: pgscatalog.core.models.CatalogScoreVariant)

Classes¶

`Allele`	A class that represents an allele found in PGS Catalog scoring files
`CatalogScoreHeader`	A ScoreHeader that validates the PGS Catalog Scoring File header standard
`CatalogScoreVariant`	A model representing a row from a PGS Catalog scoring file, defined here:
`ScoreFormatVersion`	See https://www.pgscatalog.org/downloads/#scoring_changes
`ScoreHeader`	Headers store useful metadata about a scoring file.
`ScoreLog`	A log that includes header information and variant summary statistics
`ScoreLogs`	A container of ScoreLog to simplify serialising to a JSON list
`ScoreVariant`	This model includes attributes useful for processing and normalising variants
`VariantLog`	This model consists of variant-level statistics we need to summarise in the ScoreLog
`VariantType`	Complex alleles are usually haplotypes/diplotypes and the gametic phase must be known to apply them accurately.

Module Contents¶

class core.lib.models.Allele¶

A class that represents an allele found in PGS Catalog scoring files

>>> simple_ea = Allele(**{"allele": "A"})
>>> simple_ea
Allele(allele='A', is_snp=True)
>>> str(simple_ea)
'A'
>>> Allele(**{"allele": "AG"})
Allele(allele='AG', is_snp=True)
>>> hla_example = Allele(**{"allele": "+"})
>>> hla_example
Allele(allele='+', is_snp=False)

>>> Allele(allele="A")
Allele(allele='A', is_snp=True)

>>> Allele(allele="A/T").has_multiple_alleles
True

serialize() → str¶: When dumping the model, flatten it to just return the allele as a string

allele: str¶

property has_multiple_alleles: bool¶

property is_snp: bool¶: SNPs are the most common type of effect allele in PGS Catalog scoring files. More complex effect alleles, like HLAs or APOE genes, often require extra work to represent in genomes. Users should be warned about complex effect alleles.

class core.lib.models.CatalogScoreHeader¶

A ScoreHeader that validates the PGS Catalog Scoring File header standard

https://www.pgscatalog.org/downloads/#dl_ftp_scoring

>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
>>> test = CatalogScoreHeader.from_path(testpath)
>>> test
CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=<ScoreFormatVersion.v2: '2.0'>, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}')
>>> test.variants_number == test.row_count
True

classmethod check_format_version(version: ScoreFormatVersion) → ScoreFormatVersion¶

classmethod check_pgp_id(pgp_id: str) → str¶

classmethod check_pgs_id(pgs_id: str) → str¶

classmethod parse_genome_build(value: str) → pgscatalog.core.lib.genomebuild.GenomeBuild | None¶

classmethod parse_weight_type(value: str | None) → str | None¶

serialize_genomebuild(genome_build: pgscatalog.core.lib.genomebuild.GenomeBuild | None, _info: pydantic.SerializationInfo) → str¶

classmethod split_traits(trait: str) → list[str]¶

HmPOS_build: Annotated[pgscatalog.core.lib.genomebuild.GenomeBuild | None, Field(default=None)]¶

HmPOS_date: Annotated[datetime.date | None, Field(default=None)]¶

HmPOS_match_chr: Annotated[str | None, Field(default=None)]¶

HmPOS_match_pos: Annotated[str | None, Field(default=None)]¶

citation: str¶

format_version: ScoreFormatVersion¶

property is_harmonised: bool¶

license: Annotated[str | None, Field('PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details.', repr=False)]¶

pgp_id: str¶

trait_efo: Annotated[list[str], Field(description="Ontology trait name, e.g. 'breast carcinoma")]¶

trait_mapped: Annotated[list[str], Field(description='Trait name')]¶

variants_number: Annotated[int, Field(gt=0, description='Number of variants listed in the PGS', default=None)]¶

weight_type: Annotated[str | None, Field(description='Variant weight type', default=None)]¶

class core.lib.models.CatalogScoreVariant¶

A model representing a row from a PGS Catalog scoring file, defined here:

https://www.pgscatalog.org/downloads/#scoring_columns

Implementation notes:

You should instantiate effect weight fields with strings (e.g. with csv.reader, which returns data as a list of strings)

The model always handles effect weights internally as strings and will coerce numeric input to strings when instantiated

Our string obsession comes from a desire to faithfully reproduce author submitted data and avoid introducing precision errors

Extra / dynamically named fields:

Only one type of dynamic field is supported. Ancestry specific allele frequency information uses labels defined by authors.

An example from the first row from PGS000662:

>>> variant_with_allelefrequency = {"chr_name": "1", "chr_position": 5743196, "effect_allele": "T", "other_allele": "C", "effect_weight": 0.102298257, "allelefrequency_effect_European": 0.067, "allelefrequency_effect_African": 0.439, "allelefrequency_effect_Asian": 0.113, "allelefrequency_effect_Hispanic": 0.157}
>>> CatalogScoreVariant(**variant_with_allelefrequency)
CatalogScoreVariant(rsID=None, chr_name='1', chr_position=5743196..., allelefrequency_effect_European=0.067, allelefrequency_effect_African=0.439, allelefrequency_effect_Asian=0.113, allelefrequency_effect_Hispanic=0.157, ...)

An example from the first row from PGS000018 with the edited column name ‘rsid’:

>>> variant_with_rsid_column = {"rsid": "rs2843152", "chr_name": 1, "chr_position": 2245570, "effect_allele": "G", "other_allele": "C", "effect_weight": -2.76009e-02}
>>> CatalogScoreVariant(**variant_with_rsid_column)
CatalogScoreVariant(rsID='rs2843152', chr_name='1', chr_position=2245570..., effect_weight='-0.0276009', ...)

Extra field names which don’t follow the pattern “allelefrequency_effect_{label}” will raise a ValueError:

>>> bad_extra_fields = variant_with_allelefrequency | {"favourite_ice_cream": "vanilla"}
>>> CatalogScoreVariant(**bad_extra_fields)
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, Invalid extra fields detected: ['favourite_ice_cream'] ...

Complex alleles are represented a little differently:

>>> complex_allele = {"chr_name": 19, "effect_allele": "APOE_e2", "effect_weight": -0.5, "locus_name": "APOE", "is_haplotype": True, "variant_type": "APOE_allele", "variant_description": None}
>>> CatalogScoreVariant(**complex_allele)
CatalogScoreVariant(rsID=None, chr_name='19', chr_position=None, effect_allele=Allele(allele='APOE_e2', is_snp=False), other_allele=None, locus_name='APOE', is_haplotype=True, is_diplotype=False, imputation_method=None, variant_description=None, inclusion_criteria=None, effect_weight='-0.5', is_interaction=False, is_dominant=False, is_recessive=False, dosage_0_weight=None, dosage_1_weight=None, dosage_2_weight=None, OR=None, HR=None, allelefrequency_effect=None, hm_source=None, hm_rsID=None, hm_chr=None, hm_pos=None, hm_inferOtherAllele=None, hm_match_chr=None, hm_match_pos=None, variant_type=<VariantType.APOE_ALLELE: 'APOE_allele'>, variant_id='19::APOE_e2:', is_harmonised=False, is_complex=True, is_non_additive=False, effect_type=EffectType.ADDITIVE)

Although effect weights are typed as optional, if all effect weight fields are missing then a model validator will raise a validation error:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "effect_weight": None})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, All effect weight fields are missing ...

effect_weight can be missing if dosage_n_weight (non-additive) fields are all present

However, dosage_n_weight fields must _all_ be present, if they’re present:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "dosage_0_weight": 0.1, "dosage_1_weight": None, "dosage_2_weight": None})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, Dosage missing effect weight ...

A variant may have all effect weight fields. During normalisation the standard effect_weight column will be used:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.05, "dosage_0_weight": 0, "dosage_1_weight": 0.1, "dosage_2_weight": 0.3})
CatalogScoreVariant(rsID=None, chr_name='19', ..., is_non_additive=False, ...

Note that is_non_additive is false if effect_weight column exists, although non-additive fields do exist.

classmethod alleles_must_parse(value: Any) → Allele¶

check_complex_variants() → CatalogScoreVariant¶

check_effect_weights() → CatalogScoreVariant¶

check_extra_fields() → CatalogScoreVariant¶: Only allelefrequency_effect_{ancestry} is supported as an extra field {ancestry} is dynamic and set by submitters

check_position() → CatalogScoreVariant¶

classmethod effect_weight_must_float(weight: str | None) → str | None¶

classmethod empty_string_to_none(v: Any) → Any | None¶

classmethod set_missing_rsid(rsid: str | None) → str | None¶

HR: Annotated[float | None, Field(default=None, title='Hazard Ratio', description='Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR).')]¶

OR: Annotated[float | None, Field(default=None, title='Odds Ratio', description='Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR).')]¶

allelefrequency_effect: Annotated[float | None, Field(default=None, title='Effect Allele Frequency', description='Reported effect allele frequency, if the associated locus is a haplotype then haplotype frequency will be extracted.', ge=0)]¶

chr_name: Annotated[str | None, Field(default=None, title='Location - Chromosome ', description='Chromosome name/number associated with the variant.', coerce_numbers_to_str=True)]¶

chr_position: Annotated[int | None, Field(default=None, title='Location within the Chromosome', description='Chromosomal position associated with the variant.', gt=0)]¶

complex_columns: ClassVar[tuple[str, str, str]] = ('is_haplotype', 'is_diplotype', 'is_interaction')¶

dosage_0_weight: Annotated[str | None, Field(default=None, title='Effect weight with 0 copy of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]¶

dosage_1_weight: Annotated[str | None, Field(default=None, title='Effect weight with 1 copy of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]¶

dosage_2_weight: Annotated[str | None, Field(default=None, title='Effect weight with 2 copies of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]¶

effect_allele: Annotated[Allele | None, Field(default=None, title='Effect Allele', description="The allele that's dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant's weight (effect_weight) when calculating score. The effect allele is also known as the 'risk allele'. Note: this does not necessarily need to correspond to the minor allele/alternative allele.")]¶

property effect_type: pgscatalog.core.lib.effecttype.EffectType¶

effect_weight: Annotated[str | None, Field(default=None, title='Variant Weight', description='Value of the effect that is multiplied by the dosage of the effect allele (effect_allele) when calculating the score. Additional information on how the effect_weight was derived is in the weight_type field of the header, and score development method in the metadata downloads.', coerce_numbers_to_str=True)]¶

harmonised_columns: ClassVar[tuple[str, str, str, str]] = ('hm_source', 'hm_rsID', 'hm_chr', 'hm_pos')¶

hm_chr: Annotated[str | None, Field(default=None, title='Harmonized chromosome name', description='Chromosome that the harmonized variant is present on, preferring matches to chromosomes over patches present in later builds.')]¶

hm_inferOtherAllele: Annotated[Allele | None, Field(default=None, title='Harmonized other alleles', description='If only the effect_allele is given we attempt to infer the non-effect/other allele(s) using Ensembl/dbSNP alleles.')]¶

hm_match_chr: Annotated[bool | None, Field(default=None, title='FLAG: matching chromosome name', description='Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_name.')]¶

hm_match_pos: Annotated[bool | None, Field(default=None, title='FLAG: matching chromosome position', description='Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_position.')]¶

hm_pos: Annotated[int | None, Field(ge=0, default=None, title='Harmonized chromosome position', description='Chromosomal position (base pair location) where the variant is located, preferring matches to chromosomes over patches present in later builds.')]¶

hm_rsID: Annotated[str | None, Field(default=None, title='Harmonized rsID', description='Current rsID. Differences between this column and the author-reported column (rsID) indicate variant merges and annotation updates from dbSNP.')]¶

hm_source: Annotated[str | None, Field(default=None, title='Provider of the harmonized variant information', description='Data source of the variant position. Options include: ENSEMBL, liftover, author-reported (if being harmonized to the same build).')]¶

imputation_method: Annotated[str | None, Field(default=None, title='Imputation Method', description='This described whether the variant was specifically called with a specific imputation or variant calling method. This is mostly kept to describe HLA-genotyping methods (e.g. flag SNP2HLA, HLA*IMP) that gives alleles that are not referenced by genomic position.')]¶

inclusion_criteria: Annotated[str | None, Field(default=None, title='Score Inclusion Criteria', description='Explanation of when this variant gets included into the PGS (e.g. if it depends on the results from other variants).')]¶

property is_complex: bool¶

is_diplotype: Annotated[bool | None, Field(default=False, title='FLAG: Diplotype', description='This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated.')]¶

is_dominant: Annotated[bool | None, Field(default=False, title='FLAG: Dominant Inheritance Model', description='This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum if there is at least 1 copy of the effect allele (e.g. it is a dominant allele).')]¶

is_haplotype: Annotated[bool | None, Field(default=False, title='FLAG: Haplotype', description='This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated.')]¶

property is_harmonised: bool¶

property is_hm_bad: bool¶: Was harmonisation OK?

is_interaction: Annotated[bool | None, Field(default=False, title='FLAG: Interaction', description='This is a TRUE/FALSE variable that flags whether the weight should be multiplied with the dosage of more than one variant. Interactions are demarcated with a _x_ between entries for each of the variants present in the interaction.')]¶

property is_non_additive: bool¶

is_recessive: Annotated[bool | None, Field(default=False, title='FLAG: Recessive Inheritance Model', description='This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum only if there are 2 copies of the effect allele (e.g. it is a recessive allele).')]¶

locus_name: Annotated[str | None, Field(default=None, title='Locus Name', description='This is kept in for loci where the variant may be referenced by the gene (APOE e4). It is also common (usually in smaller PGS) to see the variants named according to the genes they impact.')]¶

model_config¶

non_additive_columns: ClassVar[tuple[str, str, str]] = ('dosage_0_weight', 'dosage_1_weight', 'dosage_2_weight')¶

other_allele: Annotated[Allele | None, Field(default=None, title='Other allele(s)', description='The other allele(s) at the loci. Note: this does not necessarily need to correspond to the reference allele.')]¶

rsID: Annotated[str | None, Field(default=None, validation_alias=AliasChoices('rsID', 'rsid'), title='dbSNP Accession ID (rsID)', description='The SNP’s rsID. This column also contains HLA alleles in the standard notation (e.g. HLA-DQA1*0102) that aren’t always provided with chromosomal positions.')]¶

variant_description: Annotated[str | None, Field(default=None, title='Variant Description', description='This field describes any extra information about the variant (e.g. how it is genotyped or scored) that cannot be captured by the other fields.')]¶

property variant_id: str¶: ID = chr:pos:effect_allele:other_allele

variant_type: Annotated[VariantType | None, Field(default=None, title='Complex alleles only: how is the variant name formatted?')]¶

class core.lib.models.ScoreFormatVersion¶

See https://www.pgscatalog.org/downloads/#scoring_changes v1 was deprecated in December 2021

v2 = '2.0'¶

class core.lib.models.ScoreHeader¶

Headers store useful metadata about a scoring file.

Data validation is less strict than the CatalogScoreHeader, to make it easier for people to use custom scoring files with the PGS Catalog Calculator.

>>> ScoreHeader(**{"pgs_id": "PGS123456", "trait_reported": "testtrait", "genome_build": "GRCh38"})
ScoreHeader(pgs_id='PGS123456', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh38)

>>> ScoreHeader(**{"omicspred_id": "OPGS123456", "trait_reported": "testtrait", "genome_build": "GRCh38"})
ScoreHeader(pgs_id='OPGS123456', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh38)

>>> ScoreHeader(**{"score_id": "SC1234B", "trait_reported": "testtrait", "genome_build": "GRCh37"})
ScoreHeader(pgs_id='SC1234B', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh37)

>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
>>> ScoreHeader.from_path(testpath).row_count
77

>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz"
>>> test = ScoreHeader.from_path(testpath) # doctest

classmethod from_path(path: str | pathlib.Path) → Self¶

classmethod parse_genome_build(value: str) → pgscatalog.core.lib.genomebuild.GenomeBuild | None¶

serialize_genomebuild(genome_build: pgscatalog.core.lib.genomebuild.GenomeBuild, _info: pydantic.SerializationInfo) → str¶

genome_build: Annotated[pgscatalog.core.lib.genomebuild.GenomeBuild | None, Field(description='Genome build')]¶

property is_harmonised: bool¶

pgs_id: Annotated[str | None, Field(title='PGS identifier', validation_alias=AliasChoices('pgs_id', 'omicspred_id', 'score_id'))]¶

pgs_name: Annotated[str | None, Field(description='PGS name', default=None)]¶

property row_count: int¶: Calculate the number of variants in the scoring file by counting the number of rows

trait_reported: Annotated[str, Field(description='Trait name')]¶

class core.lib.models.ScoreLog¶

A log that includes header information and variant summary statistics

>>> header = CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=ScoreFormatVersion.v2, trait_mapped='breast carcinoma', trait_efo='EFO_0000305', variants_number=77, weight_type="NR", pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build="GRCh38", HmPOS_date="2022-07-29")
>>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "HLA-DQ", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
>>> variant_log = harmonised_variant.model_dump(include={"hm_source", "is_complex"})
>>> scorelog = ScoreLog(header=header, compatible_effect_type=True, variant_logs=[VariantLog(**variant_log)])
>>> scorelog
ScoreLog(header=CatalogScoreHeader(...), compatible_effect_type=True, has_complex_alleles=True, pgs_id='PGS000001', is_harmonised=True, sources=['ENSEMBL'])

In the original scoring file header there were 77 variants:

>>> scorelog.header.variants_number
77

But we’ve only got 1 ScoreVariant:

>>> scorelog.n_actual_variants
1
>>> scorelog.variant_count_difference
76
>>> scorelog.variants_are_missing
True

Maybe they were all filtered out during normalisationIt’s important to log and warn when this happens.

>>> scorelog.sources
['ENSEMBL']

>>> scorelog.model_dump()
{'header': {'pgs_id': 'PGS000001', ...}, 'compatible_effect_type': True, 'has_complex_alleles': True, 'pgs_id': 'PGS000001', 'is_harmonised': True, 'sources': ['ENSEMBL']}

compatible_effect_type: bool¶

property has_complex_alleles: bool¶: Do any variants contain complex alleles? e.g. HLA/APOE

header: ScoreHeader | CatalogScoreHeader¶

property is_harmonised: bool¶

model_config¶

property n_actual_variants: int | None¶

property pgs_id: str | None¶

property sources: list[str] | None¶

property variant_count_difference: int | None¶

variant_logs: list[VariantLog] | None¶

property variants_are_missing: bool¶

class core.lib.models.ScoreLogs¶

A container of ScoreLog to simplify serialising to a JSON list

root: list[ScoreLog]¶

class core.lib.models.ScoreVariant¶

This model includes attributes useful for processing and normalising variants

>>> variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "row_nr": 0, "accession": "test"})
>>> variant
ScoreVariant(rsID=None, chr_name='1', chr_position=1, effect_allele=Allele(allele='A', ...
>>> variant.is_complex
False
>>> variant.is_non_additive
False
>>> variant.is_harmonised
False
>>> variant.effect_type
EffectType.ADDITIVE

>>> variant_missing_positions = ScoreVariant(**{"rsID": None, "chr_name": None, "chr_position": None, "effect_allele": "A", "effect_weight": 0.5,  "row_nr": 0, "accession": "test"})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for ScoreVariant
  Value error, Bad position: self.rsID=None, self.chr_name=None, self.chr_position=None...
  ...

>>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
>>> harmonised_variant.is_harmonised
True

>>> variant_nonadditive = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "dosage_0_weight": 0, "dosage_1_weight": 1,  "dosage_2_weight": 0, "row_nr": 0, "accession": "test"})
>>> variant_nonadditive.is_non_additive
True
>>> variant_nonadditive.is_complex
False
>>> variant_nonadditive.effect_type
EffectType.NONADDITIVE

>>> variant_complex = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "is_haplotype": True,  "row_nr": 0, "accession": "test"})
>>> variant_complex.is_complex
True

The harmonisation process might fail, so variants can be missing mandatory fields.

This must be supported by the model:

>>> bad_hm_variant = ScoreVariant(**{"rsID": "a_weird_rsid", "chr_name": None, "chr_position": None, "effect_allele": "G", "effect_weight": 0.5, "hm_chr": None, "hm_pos": None, "hm_rsID": None, "hm_source": "Unknown",  "row_nr": 0, "accession": "test"})
>>> bad_hm_variant.is_harmonised
True
>>> bad_hm_variant.is_hm_bad
True
>>> harmonised_variant.is_hm_bad
False

rsID format validation (i.e. starts with rs, ss…) is disabled when harmonisation fails:

>>> bad_hm_variant.rsID
'a_weird_rsid'

accession: Annotated[str, Field(title='Accession', description='Accession of score variant')]¶

is_duplicated: Annotated[bool | None, Field(default=False, title='Duplicated variant', description='In a list of variants with the same accession, is ID duplicated?')]¶

model_config¶

output_fields: ClassVar[tuple[str, Ellipsis]] = ('chr_name', 'chr_position', 'effect_allele', 'other_allele', 'effect_weight', 'effect_type',...¶

row_nr: Annotated[int, Field(title='Row number', description='Row number of variant in scoring file (first variant = 0)', ge=0)]¶

class core.lib.models.VariantLog¶

This model consists of variant-level statistics we need to summarise in the ScoreLog

Can’t just reuse ScoreVariants because failed harmonisation can create invalid ScoreVariants (e.g. missing genomic coordinates)

If ScoreLogs are composed of ScoreVariants then the data would be revalidated on instantiation and raise ValidationErrors

Instead, just create VariantLogs from a subset of ScoreVariant fields

hm_source: str | None = None¶

is_complex: bool¶

class core.lib.models.VariantType¶

Complex alleles are usually haplotypes/diplotypes and the gametic phase must be known to apply them accurately.

See PGS Catalog Curation Guidelines: Appendix A – Special Cases for more information

(Not supported by the calculator.)

APOE_ALLELE = 'APOE_allele'¶

CYP_ALLELE = 'CYP_allele'¶

HLA_AA = 'HLA_AA'¶

HLA_ALLELE = 'HLA_allele'¶

HLA_SEROTYPE = 'HLA_serotype'¶

core.lib.models¶

Classes¶

Module Contents¶

pygscatalog

Navigation

Related Topics