Database Schema¶
Entity-relationship diagram of the Biofilter 4 database, derived directly from the SQLAlchemy ORM models.
The schema is organized into four logical zones:
ETL infrastructure — source systems, data sources, and execution packages
Core entity layer — canonical entities, aliases, relationships, and genomic locations
Domain master tables — gene, protein, pathway, disease, chemical, GO
Variant tables — partitioned tables for large-scale variant data (gnomAD, VEP, GWAS)
ER Diagram¶
erDiagram
%% ── ETL INFRASTRUCTURE ──────────────────────────────────────────────────
etl_source_systems {
int id PK
string name
bool active
}
etl_data_sources {
int id PK
string name
int source_system_id FK
string data_type
string dtp_script
bool active
}
etl_packages {
int id PK
int data_source_id FK
string status
string extract_status
string transform_status
string load_status
}
%% ── CORE ENTITY LAYER ───────────────────────────────────────────────────
entity_groups {
int id PK
string name
}
entities {
bigint id PK
int group_id FK
int data_source_id FK
int etl_package_id FK
bool is_active
bool has_conflict
}
entity_aliases {
bigint id PK
bigint entity_id FK
int group_id FK
string alias_value
string alias_type
string xref_source
bool is_primary
}
entity_relationship_types {
int id PK
string code
string description
}
entity_relationships {
bigint id PK
bigint entity_1_id FK
int entity_1_group_id FK
bigint entity_2_id FK
int entity_2_group_id FK
int relationship_type_id FK
int data_source_id FK
}
entity_locations {
bigint id PK
bigint entity_id FK
int assembly_id FK
int chromosome
bigint start_pos
bigint end_pos
string strand
}
%% ── REFERENCE TABLES ────────────────────────────────────────────────────
genome_assemblies {
int id PK
string accession
string assembly_name
string chromosome
}
omic_status {
int id PK
string name
}
%% ── GENE DOMAIN ─────────────────────────────────────────────────────────
gene_locus_groups {
int id PK
string name
}
gene_locus_types {
int id PK
string name
}
gene_masters {
int id PK
bigint entity_id FK
string symbol
int locus_group_id FK
int locus_type_id FK
int omic_status_id FK
}
gene_groups {
int id PK
string name
}
gene_group_memberships {
int gene_id FK
int group_id FK
}
%% ── PROTEIN DOMAIN ──────────────────────────────────────────────────────
protein_masters {
int id PK
string protein_id
}
protein_entities {
int id PK
bigint entity_id FK
int protein_id FK
bool is_isoform
}
protein_pfams {
int id PK
string pfam_acc
string pfam_id
string type
}
protein_pfam_links {
int protein_id FK
int pfam_pk_id FK
}
%% ── PATHWAY DOMAIN ──────────────────────────────────────────────────────
pathway_masters {
int id PK
bigint entity_id FK
string pathway_id
string description
}
%% ── DISEASE DOMAIN ──────────────────────────────────────────────────────
disease_masters {
int id PK
bigint entity_id FK
string disease_id
int omic_status_id FK
}
disease_groups {
int id PK
string name
}
disease_group_memberships {
int id PK
int disease_id FK
int group_id FK
}
%% ── CHEMICAL DOMAIN ─────────────────────────────────────────────────────
chemical_masters {
int id PK
bigint entity_id FK
string chemical_id
int omic_status_id FK
string formula
float mass
}
%% ── GENE ONTOLOGY DOMAIN ────────────────────────────────────────────────
go_masters {
int id PK
bigint entity_id FK
string go_id
string name
string namespace
}
go_relations {
int id PK
int parent_id FK
int child_id FK
string relation_type
}
%% ── VARIANT DOMAIN (partitioned by chromosome) ──────────────────────────
variant_masters {
bigint variant_id PK
int chromosome PK
bigint position_start
bigint position_end
string reference_allele
string alternate_allele
string rsid
float af
}
variant_molecular_effects {
bigint variant_id FK
int chromosome
string transcript_id
int consequence_id FK
int impact_id FK
bool is_most_severe_for_variant
string hgvsc
string hgvsp
}
variant_effect_predictions {
bigint variant_id FK
int chromosome
string predictor_key
string predictor_name
float score
string classification
}
variant_consequence_groups {
int id PK
string name
}
variant_consequence_categories {
int id PK
string name
}
variant_consequences {
int id PK
string name
int consequence_group_id FK
int consequence_category_id FK
int severity_rank
}
variant_impacts {
int id PK
string name
int severity_rank
}
variant_gwas {
bigint id PK
string snp_id
string raw_trait
string mapped_trait
float p_value
}
variant_gwas_snp {
bigint id PK
bigint variant_gwas_id FK
bigint snp_id
}
variant_snp_merges {
bigint rs_obsolete_id PK
bigint rs_canonical_id PK
}
%% ── RELATIONSHIPS ───────────────────────────────────────────────────────
%% ETL
etl_source_systems ||--o{ etl_data_sources : "has"
etl_data_sources ||--o{ etl_packages : "tracks"
%% Entity core
entity_groups ||--o{ entities : "classifies"
entities ||--o{ entity_aliases : "has"
entity_groups ||--o{ entity_aliases : "scopes"
entities ||--o{ entity_relationships : "as entity_1"
entities ||--o{ entity_relationships : "as entity_2"
entity_relationship_types ||--o{ entity_relationships : "typed by"
entities ||--o{ entity_locations : "located at"
genome_assemblies ||--o{ entity_locations : "assembly"
%% Gene
entities ||--o{ gene_masters : "gene"
gene_locus_groups ||--o{ gene_masters : "locus group"
gene_locus_types ||--o{ gene_masters : "locus type"
omic_status ||--o{ gene_masters : "status"
gene_masters ||--o{ gene_group_memberships : "belongs to"
gene_groups ||--o{ gene_group_memberships : "has"
%% Protein
entities ||--o{ protein_entities : "entity link"
protein_masters ||--o{ protein_entities : "protein"
protein_masters ||--o{ protein_pfam_links : "has domain"
protein_pfams ||--o{ protein_pfam_links : "domain"
%% Pathway / Disease / Chemical / GO
entities ||--o{ pathway_masters : "pathway"
entities ||--o{ disease_masters : "disease"
omic_status ||--o{ disease_masters : "status"
disease_masters ||--o{ disease_group_memberships : "belongs to"
disease_groups ||--o{ disease_group_memberships : "has"
entities ||--o{ chemical_masters : "chemical"
omic_status ||--o{ chemical_masters : "status"
entities ||--o{ go_masters : "GO term"
go_masters ||--o{ go_relations : "parent"
go_masters ||--o{ go_relations : "child"
%% Variants
variant_masters ||--o{ variant_molecular_effects : "effects"
variant_consequence_groups ||--o{ variant_consequences : "group"
variant_consequence_categories ||--o{ variant_consequences : "category"
variant_masters ||--o{ variant_effect_predictions : "predictions"
variant_gwas ||--o{ variant_gwas_snp : "indexed SNP"
Key design notes¶
Entity as the universal anchor — every domain master table (
gene_masters,disease_masters,pathway_masters,chemical_masters,go_masters) links back to a single row inentities. This enables cross-domain relationship queries throughentity_relationshipswithout domain-specific join logic.entity_aliasesas the name registry — all names, symbols, synonyms, and external codes (HGNC, Ensembl, OMIM, ICD10, MONDO, ChEBI) are stored here.alias_typeispreferred,synonym, orcode;xref_sourceidentifies the originating system.entity_relationshipsas the graph surface — connects any two entities with a typed, directed edge. Supports multi-hop traversal (e.g., gene → pathway → disease) entirely within SQL.Variant tables are partitioned —
variant_masters,variant_molecular_effects,variant_effect_predictions,variant_regulatory_elements, andvariant_gene_regulatory_evidenceare partitioned bychromosomeon PostgreSQL (plain tables on SQLite). No physical FK constraints — integrity is ETL-enforced for performance at scale.Provenance on every row —
data_source_idandetl_package_idare present on virtually every table, enabling full traceability back to the source system and the exact ETL execution that produced each record.etl_packagesas the audit log — each ETL run produces oneETLPackagerow per DataSource, tracking extract/transform/load status, row counts, file hashes, and timing.