core.lib.models

PGS Catalog pydantic models for data validation

Best way to reuse:

  • from pgscatalog.core import models and use models.CatalogScoreVariant(**d)

  • import pgscatalog.core and use fully qualified name: pgscatalog.core.models.CatalogScoreVariant)

Classes

Allele

A class that represents an allele found in PGS Catalog scoring files

CatalogScoreHeader

A ScoreHeader that validates the PGS Catalog Scoring File header standard

CatalogScoreVariant

A model representing a row from a PGS Catalog scoring file, defined here:

ScoreFormatVersion

See https://www.pgscatalog.org/downloads/#scoring_changes

ScoreHeader

Headers store useful metadata about a scoring file.

ScoreLog

A log that includes header information and variant summary statistics

ScoreLogs

A container of ScoreLog to simplify serialising to a JSON list

ScoreVariant

This model includes attributes useful for processing and normalising variants

VariantLog

This model consists of variant-level statistics we need to summarise in the ScoreLog

VariantType

Complex alleles are usually haplotypes/diplotypes and the gametic phase must be known to apply them accurately.

Module Contents

class core.lib.models.Allele

A class that represents an allele found in PGS Catalog scoring files

>>> simple_ea = Allele(**{"allele": "A"})
>>> simple_ea
Allele(allele='A', is_snp=True)
>>> str(simple_ea)
'A'
>>> Allele(**{"allele": "AG"})
Allele(allele='AG', is_snp=True)
>>> hla_example = Allele(**{"allele": "+"})
>>> hla_example
Allele(allele='+', is_snp=False)
>>> Allele(allele="A")
Allele(allele='A', is_snp=True)
>>> Allele(allele="A/T").has_multiple_alleles
True
serialize() str

When dumping the model, flatten it to just return the allele as a string

allele: str
property has_multiple_alleles: bool
property is_snp: bool

SNPs are the most common type of effect allele in PGS Catalog scoring files. More complex effect alleles, like HLAs or APOE genes, often require extra work to represent in genomes. Users should be warned about complex effect alleles.

class core.lib.models.CatalogScoreHeader

A ScoreHeader that validates the PGS Catalog Scoring File header standard

https://www.pgscatalog.org/downloads/#dl_ftp_scoring

>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
>>> test = CatalogScoreHeader.from_path(testpath)
>>> test
CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=<ScoreFormatVersion.v2: '2.0'>, trait_mapped=['breast carcinoma'], trait_efo=['EFO_0000305'], variants_number=77, weight_type=None, pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build=GenomeBuild.GRCh38, HmPOS_date=datetime.date(2022, 7, 29), HmPOS_match_pos='{"True": null, "False": null}', HmPOS_match_chr='{"True": null, "False": null}')
>>> test.variants_number == test.row_count
True
classmethod check_format_version(version: ScoreFormatVersion) ScoreFormatVersion
classmethod check_pgp_id(pgp_id: str) str
classmethod check_pgs_id(pgs_id: str) str
classmethod parse_genome_build(value: str) pgscatalog.core.lib.genomebuild.GenomeBuild | None
classmethod parse_weight_type(value: str | None) str | None
serialize_genomebuild(genome_build: pgscatalog.core.lib.genomebuild.GenomeBuild | None, _info: pydantic.SerializationInfo) str
classmethod split_traits(trait: str) list[str]
HmPOS_build: Annotated[pgscatalog.core.lib.genomebuild.GenomeBuild | None, Field(default=None)]
HmPOS_date: Annotated[datetime.date | None, Field(default=None)]
HmPOS_match_chr: Annotated[str | None, Field(default=None)]
HmPOS_match_pos: Annotated[str | None, Field(default=None)]
citation: str
format_version: ScoreFormatVersion
property is_harmonised: bool
license: Annotated[str | None, Field('PGS obtained from the Catalog should be cited appropriately, and used in accordance with any licensing restrictions set by the authors. See EBI Terms of Use (https://www.ebi.ac.uk/about/terms-of-use/) for additional details.', repr=False)]
pgp_id: str
trait_efo: Annotated[list[str], Field(description="Ontology trait name, e.g. 'breast carcinoma")]
trait_mapped: Annotated[list[str], Field(description='Trait name')]
variants_number: Annotated[int, Field(gt=0, description='Number of variants listed in the PGS', default=None)]
weight_type: Annotated[str | None, Field(description='Variant weight type', default=None)]
class core.lib.models.CatalogScoreVariant

A model representing a row from a PGS Catalog scoring file, defined here:

https://www.pgscatalog.org/downloads/#scoring_columns

Implementation notes:

  • You should instantiate effect weight fields with strings (e.g. with csv.reader, which returns data as a list of strings)

  • The model always handles effect weights internally as strings and will coerce numeric input to strings when instantiated

  • Our string obsession comes from a desire to faithfully reproduce author submitted data and avoid introducing precision errors

Extra / dynamically named fields:

Only one type of dynamic field is supported. Ancestry specific allele frequency information uses labels defined by authors.

An example from the first row from PGS000662:

>>> variant_with_allelefrequency = {"chr_name": "1", "chr_position": 5743196, "effect_allele": "T", "other_allele": "C", "effect_weight": 0.102298257, "allelefrequency_effect_European": 0.067, "allelefrequency_effect_African": 0.439, "allelefrequency_effect_Asian": 0.113, "allelefrequency_effect_Hispanic": 0.157}
>>> CatalogScoreVariant(**variant_with_allelefrequency)
CatalogScoreVariant(rsID=None, chr_name='1', chr_position=5743196..., allelefrequency_effect_European=0.067, allelefrequency_effect_African=0.439, allelefrequency_effect_Asian=0.113, allelefrequency_effect_Hispanic=0.157, ...)

An example from the first row from PGS000018 with the edited column name ‘rsid’:

>>> variant_with_rsid_column = {"rsid": "rs2843152", "chr_name": 1, "chr_position": 2245570, "effect_allele": "G", "other_allele": "C", "effect_weight": -2.76009e-02}
>>> CatalogScoreVariant(**variant_with_rsid_column)
CatalogScoreVariant(rsID='rs2843152', chr_name='1', chr_position=2245570..., effect_weight='-0.0276009', ...)

Extra field names which don’t follow the pattern “allelefrequency_effect_{label}” will raise a ValueError:

>>> bad_extra_fields = variant_with_allelefrequency | {"favourite_ice_cream": "vanilla"}
>>> CatalogScoreVariant(**bad_extra_fields)
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, Invalid extra fields detected: ['favourite_ice_cream'] ...

Complex alleles are represented a little differently:

>>> complex_allele = {"chr_name": 19, "effect_allele": "APOE_e2", "effect_weight": -0.5, "locus_name": "APOE", "is_haplotype": True, "variant_type": "APOE_allele", "variant_description": None}
>>> CatalogScoreVariant(**complex_allele)
CatalogScoreVariant(rsID=None, chr_name='19', chr_position=None, effect_allele=Allele(allele='APOE_e2', is_snp=False), other_allele=None, locus_name='APOE', is_haplotype=True, is_diplotype=False, imputation_method=None, variant_description=None, inclusion_criteria=None, effect_weight='-0.5', is_interaction=False, is_dominant=False, is_recessive=False, dosage_0_weight=None, dosage_1_weight=None, dosage_2_weight=None, OR=None, HR=None, allelefrequency_effect=None, hm_source=None, hm_rsID=None, hm_chr=None, hm_pos=None, hm_inferOtherAllele=None, hm_match_chr=None, hm_match_pos=None, variant_type=<VariantType.APOE_ALLELE: 'APOE_allele'>, variant_id='19::APOE_e2:', is_harmonised=False, is_complex=True, is_non_additive=False, effect_type=EffectType.ADDITIVE)

Although effect weights are typed as optional, if all effect weight fields are missing then a model validator will raise a validation error:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "effect_weight": None})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, All effect weight fields are missing ...

effect_weight can be missing if dosage_n_weight (non-additive) fields are all present

However, dosage_n_weight fields must _all_ be present, if they’re present:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "dosage_0_weight": 0.1, "dosage_1_weight": None, "dosage_2_weight": None})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for CatalogScoreVariant
  Value error, Dosage missing effect weight ...

A variant may have all effect weight fields. During normalisation the standard effect_weight column will be used:

>>> CatalogScoreVariant(**{"chr_name": "19", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.05, "dosage_0_weight": 0, "dosage_1_weight": 0.1, "dosage_2_weight": 0.3})
CatalogScoreVariant(rsID=None, chr_name='19', ..., is_non_additive=False, ...

Note that is_non_additive is false if effect_weight column exists, although non-additive fields do exist.

classmethod alleles_must_parse(value: Any) Allele
check_complex_variants() CatalogScoreVariant
check_effect_weights() CatalogScoreVariant
check_extra_fields() CatalogScoreVariant

Only allelefrequency_effect_{ancestry} is supported as an extra field {ancestry} is dynamic and set by submitters

check_position() CatalogScoreVariant
classmethod effect_weight_must_float(weight: str | None) str | None
classmethod empty_string_to_none(v: Any) Any | None
classmethod set_missing_rsid(rsid: str | None) str | None
HR: Annotated[float | None, Field(default=None, title='Hazard Ratio', description='Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR).')]
OR: Annotated[float | None, Field(default=None, title='Odds Ratio', description='Author-reported effect sizes can be supplied to the Catalog. If no other effect_weight is given the weight is calculated using the log(OR) or log(HR).')]
allelefrequency_effect: Annotated[float | None, Field(default=None, title='Effect Allele Frequency', description='Reported effect allele frequency, if the associated locus is a haplotype then haplotype frequency will be extracted.', ge=0)]
chr_name: Annotated[str | None, Field(default=None, title='Location - Chromosome ', description='Chromosome name/number associated with the variant.', coerce_numbers_to_str=True)]
chr_position: Annotated[int | None, Field(default=None, title='Location within the Chromosome', description='Chromosomal position associated with the variant.', gt=0)]
complex_columns: ClassVar[tuple[str, str, str]] = ('is_haplotype', 'is_diplotype', 'is_interaction')
dosage_0_weight: Annotated[str | None, Field(default=None, title='Effect weight with 0 copy of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]
dosage_1_weight: Annotated[str | None, Field(default=None, title='Effect weight with 1 copy of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]
dosage_2_weight: Annotated[str | None, Field(default=None, title='Effect weight with 2 copies of the effect allele', description='Weights that are specific to different dosages of the effect_allele (e.g. {0, 1, 2} copies) can also be reported when the the contribution of the variants to the score is not encoded as additive, dominant, or recessive. In this case three columns are added corresponding to which variant weight should be applied for each dosage, where the column name is formated as dosage_#_weight where the # sign indicates the number of effect_allele copies.', coerce_numbers_to_str=True)]
effect_allele: Annotated[Allele | None, Field(default=None, title='Effect Allele', description="The allele that's dosage is counted (e.g. {0, 1, 2}) and multiplied by the variant's weight (effect_weight) when calculating score. The effect allele is also known as the 'risk allele'. Note: this does not necessarily need to correspond to the minor allele/alternative allele.")]
property effect_type: pgscatalog.core.lib.effecttype.EffectType
effect_weight: Annotated[str | None, Field(default=None, title='Variant Weight', description='Value of the effect that is multiplied by the dosage of the effect allele (effect_allele) when calculating the score. Additional information on how the effect_weight was derived is in the weight_type field of the header, and score development method in the metadata downloads.', coerce_numbers_to_str=True)]
harmonised_columns: ClassVar[tuple[str, str, str, str]] = ('hm_source', 'hm_rsID', 'hm_chr', 'hm_pos')
hm_chr: Annotated[str | None, Field(default=None, title='Harmonized chromosome name', description='Chromosome that the harmonized variant is present on, preferring matches to chromosomes over patches present in later builds.')]
hm_inferOtherAllele: Annotated[Allele | None, Field(default=None, title='Harmonized other alleles', description='If only the effect_allele is given we attempt to infer the non-effect/other allele(s) using Ensembl/dbSNP alleles.')]
hm_match_chr: Annotated[bool | None, Field(default=None, title='FLAG: matching chromosome name', description='Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_name.')]
hm_match_pos: Annotated[bool | None, Field(default=None, title='FLAG: matching chromosome position', description='Used for QC. Only provided if the scoring file is being harmonized to the same genome build, and where the chromosome name is provided in the column chr_position.')]
hm_pos: Annotated[int | None, Field(ge=0, default=None, title='Harmonized chromosome position', description='Chromosomal position (base pair location) where the variant is located, preferring matches to chromosomes over patches present in later builds.')]
hm_rsID: Annotated[str | None, Field(default=None, title='Harmonized rsID', description='Current rsID. Differences between this column and the author-reported column (rsID) indicate variant merges and annotation updates from dbSNP.')]
hm_source: Annotated[str | None, Field(default=None, title='Provider of the harmonized variant information', description='Data source of the variant position. Options include: ENSEMBL, liftover, author-reported (if being harmonized to the same build).')]
imputation_method: Annotated[str | None, Field(default=None, title='Imputation Method', description='This described whether the variant was specifically called with a specific imputation or variant calling method. This is mostly kept to describe HLA-genotyping methods (e.g. flag SNP2HLA, HLA*IMP) that gives alleles that are not referenced by genomic position.')]
inclusion_criteria: Annotated[str | None, Field(default=None, title='Score Inclusion Criteria', description='Explanation of when this variant gets included into the PGS (e.g. if it depends on the results from other variants).')]
property is_complex: bool
is_diplotype: Annotated[bool | None, Field(default=False, title='FLAG: Diplotype', description='This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated.')]
is_dominant: Annotated[bool | None, Field(default=False, title='FLAG: Dominant Inheritance Model', description='This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum if there is at least 1 copy of the effect allele (e.g. it is a dominant allele).')]
is_haplotype: Annotated[bool | None, Field(default=False, title='FLAG: Haplotype', description='This is a TRUE/FALSE variable that flags whether the effect allele is a haplotype/diplotype rather than a single SNP. Constituent SNPs in the haplotype are semi-colon separated.')]
property is_harmonised: bool
property is_hm_bad: bool

Was harmonisation OK?

is_interaction: Annotated[bool | None, Field(default=False, title='FLAG: Interaction', description='This is a TRUE/FALSE variable that flags whether the weight should be multiplied with the dosage of more than one variant. Interactions are demarcated with a _x_ between entries for each of the variants present in the interaction.')]
property is_non_additive: bool
is_recessive: Annotated[bool | None, Field(default=False, title='FLAG: Recessive Inheritance Model', description='This is a TRUE/FALSE variable that flags whether the weight should be added to the PGS sum only if there are 2 copies of the effect allele (e.g. it is a recessive allele).')]
locus_name: Annotated[str | None, Field(default=None, title='Locus Name', description='This is kept in for loci where the variant may be referenced by the gene (APOE e4). It is also common (usually in smaller PGS) to see the variants named according to the genes they impact.')]
model_config
non_additive_columns: ClassVar[tuple[str, str, str]] = ('dosage_0_weight', 'dosage_1_weight', 'dosage_2_weight')
other_allele: Annotated[Allele | None, Field(default=None, title='Other allele(s)', description='The other allele(s) at the loci. Note: this does not necessarily need to correspond to the reference allele.')]
rsID: Annotated[str | None, Field(default=None, validation_alias=AliasChoices('rsID', 'rsid'), title='dbSNP Accession ID (rsID)', description='The SNP’s rsID. This column also contains HLA alleles in the standard notation (e.g. HLA-DQA1*0102) that aren’t always provided with chromosomal positions.')]
variant_description: Annotated[str | None, Field(default=None, title='Variant Description', description='This field describes any extra information about the variant (e.g. how it is genotyped or scored) that cannot be captured by the other fields.')]
property variant_id: str

ID = chr:pos:effect_allele:other_allele

variant_type: Annotated[VariantType | None, Field(default=None, title='Complex alleles only: how is the variant name formatted?')]
class core.lib.models.ScoreFormatVersion

See https://www.pgscatalog.org/downloads/#scoring_changes v1 was deprecated in December 2021

v2 = '2.0'
class core.lib.models.ScoreHeader

Headers store useful metadata about a scoring file.

Data validation is less strict than the CatalogScoreHeader, to make it easier for people to use custom scoring files with the PGS Catalog Calculator.

>>> ScoreHeader(**{"pgs_id": "PGS123456", "trait_reported": "testtrait", "genome_build": "GRCh38"})
ScoreHeader(pgs_id='PGS123456', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh38)
>>> ScoreHeader(**{"omicspred_id": "OPGS123456", "trait_reported": "testtrait", "genome_build": "GRCh38"})
ScoreHeader(pgs_id='OPGS123456', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh38)
>>> ScoreHeader(**{"score_id": "SC1234B", "trait_reported": "testtrait", "genome_build": "GRCh37"})
ScoreHeader(pgs_id='SC1234B', pgs_name=None, trait_reported='testtrait', genome_build=GenomeBuild.GRCh37)
>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "PGS000001_hmPOS_GRCh38.txt.gz"
>>> ScoreHeader.from_path(testpath).row_count
77
>>> from ._config import Config
>>> testpath = Config.ROOT_DIR / "tests" / "data" / "OPGS002493.txt.gz"
>>> test = ScoreHeader.from_path(testpath) # doctest
classmethod from_path(path: str | pathlib.Path) Self
classmethod parse_genome_build(value: str) pgscatalog.core.lib.genomebuild.GenomeBuild | None
serialize_genomebuild(genome_build: pgscatalog.core.lib.genomebuild.GenomeBuild, _info: pydantic.SerializationInfo) str
genome_build: Annotated[pgscatalog.core.lib.genomebuild.GenomeBuild | None, Field(description='Genome build')]
property is_harmonised: bool
pgs_id: Annotated[str | None, Field(title='PGS identifier', validation_alias=AliasChoices('pgs_id', 'omicspred_id', 'score_id'))]
pgs_name: Annotated[str | None, Field(description='PGS name', default=None)]
property row_count: int

Calculate the number of variants in the scoring file by counting the number of rows

trait_reported: Annotated[str, Field(description='Trait name')]
class core.lib.models.ScoreLog

A log that includes header information and variant summary statistics

>>> header = CatalogScoreHeader(pgs_id='PGS000001', pgs_name='PRS77_BC', trait_reported='Breast cancer', genome_build=None, format_version=ScoreFormatVersion.v2, trait_mapped='breast carcinoma', trait_efo='EFO_0000305', variants_number=77, weight_type="NR", pgp_id='PGP000001', citation='Mavaddat N et al. J Natl Cancer Inst (2015). doi:10.1093/jnci/djv036', HmPOS_build="GRCh38", HmPOS_date="2022-07-29")
>>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "HLA-DQ", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
>>> variant_log = harmonised_variant.model_dump(include={"hm_source", "is_complex"})
>>> scorelog = ScoreLog(header=header, compatible_effect_type=True, variant_logs=[VariantLog(**variant_log)])
>>> scorelog
ScoreLog(header=CatalogScoreHeader(...), compatible_effect_type=True, has_complex_alleles=True, pgs_id='PGS000001', is_harmonised=True, sources=['ENSEMBL'])

In the original scoring file header there were 77 variants:

>>> scorelog.header.variants_number
77

But we’ve only got 1 ScoreVariant:

>>> scorelog.n_actual_variants
1
>>> scorelog.variant_count_difference
76
>>> scorelog.variants_are_missing
True

Maybe they were all filtered out during normalisationIt’s important to log and warn when this happens.

>>> scorelog.sources
['ENSEMBL']
>>> scorelog.model_dump()
{'header': {'pgs_id': 'PGS000001', ...}, 'compatible_effect_type': True, 'has_complex_alleles': True, 'pgs_id': 'PGS000001', 'is_harmonised': True, 'sources': ['ENSEMBL']}
compatible_effect_type: bool
property has_complex_alleles: bool

Do any variants contain complex alleles? e.g. HLA/APOE

header: ScoreHeader | CatalogScoreHeader
property is_harmonised: bool
model_config
property n_actual_variants: int | None
property pgs_id: str | None
property sources: list[str] | None
property variant_count_difference: int | None
variant_logs: list[VariantLog] | None
property variants_are_missing: bool
class core.lib.models.ScoreLogs

A container of ScoreLog to simplify serialising to a JSON list

root: list[ScoreLog]
class core.lib.models.ScoreVariant

This model includes attributes useful for processing and normalising variants

>>> variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "row_nr": 0, "accession": "test"})
>>> variant
ScoreVariant(rsID=None, chr_name='1', chr_position=1, effect_allele=Allele(allele='A', ...
>>> variant.is_complex
False
>>> variant.is_non_additive
False
>>> variant.is_harmonised
False
>>> variant.effect_type
EffectType.ADDITIVE
>>> variant_missing_positions = ScoreVariant(**{"rsID": None, "chr_name": None, "chr_position": None, "effect_allele": "A", "effect_weight": 0.5,  "row_nr": 0, "accession": "test"})
Traceback (most recent call last):
...
pydantic_core._pydantic_core.ValidationError: 1 validation error for ScoreVariant
  Value error, Bad position: self.rsID=None, self.chr_name=None, self.chr_position=None...
  ...
>>> harmonised_variant = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "hm_chr": "1", "hm_pos": 1, "hm_rsID": "rs1921", "hm_source": "ENSEMBL",  "row_nr": 0, "accession": "test"})
>>> harmonised_variant.is_harmonised
True
>>> variant_nonadditive = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "dosage_0_weight": 0, "dosage_1_weight": 1,  "dosage_2_weight": 0, "row_nr": 0, "accession": "test"})
>>> variant_nonadditive.is_non_additive
True
>>> variant_nonadditive.is_complex
False
>>> variant_nonadditive.effect_type
EffectType.NONADDITIVE
>>> variant_complex = ScoreVariant(**{"rsID": None, "chr_name": "1", "chr_position": 1, "effect_allele": "A", "effect_weight": 0.5, "is_haplotype": True,  "row_nr": 0, "accession": "test"})
>>> variant_complex.is_complex
True

The harmonisation process might fail, so variants can be missing mandatory fields.

This must be supported by the model:

>>> bad_hm_variant = ScoreVariant(**{"rsID": "a_weird_rsid", "chr_name": None, "chr_position": None, "effect_allele": "G", "effect_weight": 0.5, "hm_chr": None, "hm_pos": None, "hm_rsID": None, "hm_source": "Unknown",  "row_nr": 0, "accession": "test"})
>>> bad_hm_variant.is_harmonised
True
>>> bad_hm_variant.is_hm_bad
True
>>> harmonised_variant.is_hm_bad
False

rsID format validation (i.e. starts with rs, ss…) is disabled when harmonisation fails:

>>> bad_hm_variant.rsID
'a_weird_rsid'
accession: Annotated[str, Field(title='Accession', description='Accession of score variant')]
is_duplicated: Annotated[bool | None, Field(default=False, title='Duplicated variant', description='In a list of variants with the same accession, is ID duplicated?')]
model_config
output_fields: ClassVar[tuple[str, Ellipsis]] = ('chr_name', 'chr_position', 'effect_allele', 'other_allele', 'effect_weight', 'effect_type',...
row_nr: Annotated[int, Field(title='Row number', description='Row number of variant in scoring file (first variant = 0)', ge=0)]
class core.lib.models.VariantLog

This model consists of variant-level statistics we need to summarise in the ScoreLog

Can’t just reuse ScoreVariants because failed harmonisation can create invalid ScoreVariants (e.g. missing genomic coordinates)

If ScoreLogs are composed of ScoreVariants then the data would be revalidated on instantiation and raise ValidationErrors

Instead, just create VariantLogs from a subset of ScoreVariant fields

hm_source: str | None = None
is_complex: bool
class core.lib.models.VariantType

Complex alleles are usually haplotypes/diplotypes and the gametic phase must be known to apply them accurately.

See PGS Catalog Curation Guidelines: Appendix A – Special Cases for more information

(Not supported by the calculator.)

APOE_ALLELE = 'APOE_allele'
CYP_ALLELE = 'CYP_allele'
HLA_AA = 'HLA_AA'
HLA_ALLELE = 'HLA_allele'
HLA_SEROTYPE = 'HLA_serotype'