Entity Model and Omics Domains

Why Entity Exists

Biofilter 4 uses an entity-centric model so different biological domains can share identity and relationships.

Instead of keeping each source isolated, BF4 stores a common entity layer and links domain records to it. This enables cross-domain queries and reusable knowledge.

Core Entity Objects

At the center of the schema:

  • EntityGroup

    • semantic type bucket (for example: Variants, Genes, Proteins, Diseases)

  • Entity

    • persistent concept record with activity/conflict flags and ETL provenance

  • EntityAlias

    • names/codes/synonyms from multiple systems (alias_type, xref_source)

  • EntityRelationshipType

    • relationship semantics (typed edge meaning)

  • EntityRelationship

    • directed link between two entities with provenance

Practical effect:

  • you can resolve aliases from many sources to one entity identity

  • you can traverse relationships across domains without hardcoded paths

Domain-Specific Master Data

The entity core is complemented by domain tables (master data), such as:

  • genes (GeneMaster and gene-related tables)

  • variants (variant master/effects/GWAS tables)

  • proteins (ProteinMaster, Pfam links)

  • pathways (PathwayMaster)

  • gene ontology (GOMaster, GORelation)

  • diseases (DiseaseMaster)

  • chemicals (ChemicalMaster)

These domain tables provide rich attributes, while entities/aliases/relationships provide integration.

Omics Domains in BF4

Operational Domains (current)

Domains with active schema + ETL/report usage today:

  • Variants

  • Genes

  • Proteins

  • Pathways

  • Gene Ontology

  • Diseases

  • Chemicals

These groups define semantic space and allow gradual expansion without redesigning the core model.

How This Appears in ETL and Reports

  • ETL loads source-specific master/relationship data and writes provenance (ETLPackage).

  • Reports such as entity_filter and entity_relationship_model operate directly on this entity layer.

  • Because identities are persistent, updates can be incremental and still query-consistent across domains.