Relationship Confidence Scores

From London Book Trades
Jump to navigation Jump to search

How family relationships are established, what can go wrong, and how to interpret the confidence scores you see on each person's page.


The Source Data

The family relationships in this database — who was married to whom, who was the child of whom — were recorded over several decades in a collection of legacy tables that were never designed with strict data integrity in mind. The people who built those tables were historians and bibliographers, not database engineers. Their priority was capturing as much information as possible; the exact form that information took was secondary.

Family relationships are drawn from three main old database tables, each with different levels of reliability:

Table What it contains Priority
Parents table Each person's record includes their father's and mother's identity numbers and names. This is the primary source for parent–child relationships. 1 — highest
Partnerships table Records of marriages and partnerships, linking two people by identity number with an optional marriage date. 2
Children table Lists of children associated with each partnership. Used to fill gaps left by the parents table, not to override it. 3 — lowest

When the same relationship appears in more than one table, the parents table takes precedence. The children table is used only to add a missing parent when the parents table has left one blank — if both parents are already established from the parents table, the children table entry is ignored.


Known Data Problems

The source tables contain a number of systematic problems that affect how confidently a relationship can be established. These are not errors introduced by our processing — they reflect genuine ambiguities and inconsistencies in the historical records themselves.

Wrong identity numbers

A parent's identity number in the parents table sometimes points to a different person than the name recorded alongside it. This appears to be a transcription error: a number was copied incorrectly while the name was written correctly, or vice versa. Where the names conflict with the identity record, we attempt to find the correct person by searching the name index.

Maiden names versus married names

Women appear under different surnames at different points in their lives — their maiden name, their married name, sometimes a subsequent married name after widowhood or remarriage. The parents table may record a mother under her married name while the identity table uses her maiden name, or the reverse. We cross-reference against the name index, which records all known surnames for each person, to resolve these cases.

Spelling variants

Names were not standardised in the sixteenth, seventeenth, and eighteenth centuries. The same person might appear as Vautroller in one record and Vautrollier in another; Purfoote and Purfoot; Brown and Browne. Where a name in the parents table differs slightly from the identity record, we check whether the person's known name variants include the spelling used.

Children assigned to multiple partnerships

A known limitation of the original data design is that every child had to be associated with a partnership record — there was no way to record a child with only one known parent. The workaround was to associate a child with every partnership involving their known parent, even when the other partner was unknown or different. This created spurious duplicate entries. We detect and ignore these redundant additions.

Sex mismatches

In a small number of cases, the person recorded as a father is identified as female in the identity table, or the person recorded as a mother is identified as male. These are flagged as anomalies. Where a mismatch is detected, the confidence score for that relationship is reduced accordingly.

Notice All anomalies detected during processing are logged in a warnings report. These represent genuine data quality issues that may benefit from manual review and correction in the source tables.


How Relationships Are Established

For each relationship, the following steps are attempted in order, stopping as soon as a confident resolution is found.

Step 1 — Direct match

The identity number given in the source table is looked up. If the name recorded alongside it matches the identity record exactly, the relationship is accepted with full confidence.

Step 2 — Name index confirmation

If the names do not match exactly, we consult the name index, which records every known surname for each person (maiden name, married name, former name, alias, and so on). If any of the person's known surnames matches what the source record says, the relationship is accepted. The type of match (née, subs., form., alias) is recorded in the explanation.

Step 3 — Spouse confirmation

For mothers specifically: if the identity number points to someone confirmed as the spouse of the father in the partnerships table, the discrepancy is treated as a maiden-name / married-name difference and the relationship is accepted.

Step 4 — Name search

If none of the above succeed, the name given in the source record is searched across all identity records. If exactly one person matches — and that person's sex is consistent with the role (father or mother) — that person is substituted for the original identity number, and the relationship is recorded with a note that a substitution was made.

Step 5 — Best available

If no unique match is found, the original identity number is retained as the best available option, with a note that the contradiction could not be resolved. This results in the lowest confidence scores.

Every relationship record carries a full explanation of which step was reached and what evidence was found.


Confidence Scores

Every relationship is assigned a confidence score from 0 to 100 based on how it was established. Higher scores mean stronger evidence; lower scores indicate that the relationship rests on inference or unresolved contradictions.

Base scores

Score How the relationship was established
100 Identity number and name match the source record exactly. No ambiguity.
95 Marriage recorded directly in the partnerships table, linking two identity numbers. No name resolution required.
90 Confirmed via the name index — the surname in the source record matches a known alternative name (maiden, married, former, or alias) for this person.
85 Child identified by a direct identity number in the children table; or name index confirmed but forenames differ slightly between the source record and the index entry.
80 Mother confirmed as the spouse of the father in the partnerships table; name difference attributed to maiden versus married name.
75 Child identified via a cross-reference in the old number field of the identity table.
70 Child identified via a cross-reference table.
65 Name search found a unique match and the original identity number was replaced with the correct person. A substitution was made.
60 Child has no identity number in the source data; a placeholder record was created from the name alone. The person's identity is uncertain.
40 A contradiction was detected but could not be resolved. The original identity number was retained as the best available option.

Penalties

The following problems, when detected, reduce the base score:

Deduction Reason
−15 The person recorded as father is identified as female in the identity table, or the person recorded as mother is identified as male. The relationship is retained but the sex conflict is noted.
−10 The identity number given in the source record does not appear in the identity table at all. The relationship was established by name search alone.
−5 The name index confirmed the surname but the forenames differ slightly between the source record and the index entry.

Scores never fall below 10. Even the most uncertain relationships are based on some evidence from the historical record.


Interpreting the Numbers

A single relationship score tells you how confident we are about that one link. When you are looking at a person's parents, children, or spouse, the score next to each relationship tells you the strength of evidence for that specific connection.

Confidence bands

Score range Interpretation Guidance
85 – 100 High confidence — The relationship is well-evidenced. Names and identity numbers agree, possibly confirmed by the name index. Suitable for citation.
70 – 84 Good confidence — The relationship is probable. Minor name discrepancies were resolved by cross-reference. Worth verifying against primary sources if precision matters.
50 – 69 Fair confidence — The relationship rests on inference: a name search substitution, an uncertain child record, or a placeholder identity. Treat with caution and seek corroboration.
10 – 49 Low confidence — A contradiction exists that could not be resolved, or a sex mismatch was detected. The relationship may be incorrect. Do not cite without independent verification.

Averages across relationship chains

When the database displays extended family connections — grandparents, cousins, or collateral lines — it chains together several individual relationships. The confidence score shown for a chained connection is the arithmetic average of all the individual scores along the chain.

Example: Suppose a connection runs through three relationships with scores of 95, 65, and 80. The confidence for the full chain is displayed as (95 + 65 + 80) ÷ 3 = 80. The weak middle link — perhaps a name-search substitution — brings the overall confidence down even though the other two links are strong.

A low average score for a distant relative does not necessarily mean the whole family line is wrong — it may mean that one link in the chain is uncertain while the rest are solid. You can inspect each individual relationship to see where the uncertainty lies.

Scores will improve over time

As errors in the source data are identified and corrected — wrong identity numbers fixed, missing maiden names added to the name index, duplicate children records removed — the migration process is re-run and confidence scores are updated automatically. A relationship that scores 65 today because a name search was required may score 100 tomorrow once the source record is corrected to use the right identity number.


Confidence scores and relationship explanations are generated automatically by the database migration process and updated whenever the source data changes. They reflect the state of the historical records as currently transcribed, not an independent assessment of the genealogical evidence.