Data Conversion Notes: Difference between revisions
| Line 53: | Line 53: | ||
==Unresolved Issues Still Being Worked On== | ==Unresolved Issues Still Being Worked On== | ||
===Children=== | ===Children=== | ||
Some people appear to have more children than is expected, and some of the children seem duplicated. This may be a consequence of assumptions made while extracting the data into the <code>relationship</code> table from <code>parents</code>, <code>children</code>, <code>siblings</code>, <code>relationships</code>, and <code>marriages</code>. | Some people appear to have more children than is expected, and some of the children seem duplicated. This may be a consequence of assumptions made while extracting the data into the <code>relationship</code> table from overlapping and redundant tables <code>parents</code>, <code>children</code>, <code>siblings</code>, <code>relationships</code>, and <code>marriages</code>. | ||
Let's illustrate with an example. Take [[BATTERSBY, Katherine ‹LBT16225›]]. | Let's illustrate with an example. Take [[BATTERSBY, Katherine ‹LBT16225›]]. S Starting with the <code>Parents</code> table, we see her father ss 13464 and her mother is 3732. The <code>Partnerships</code> record with those two members is number 6812. The <code>Children</code> table has eight records pointing to that Partnership; those eight include a Katherine but she doesn't get an original lbtnumber, so her identity is ambiguous. The <code>Parents</code> table has 11409 and 3549 as Katherine's parents, and they are in partnership 5027. On the other hand, starting with the <code>Children</code> table, we see relationship 5027 as the parents of 16225, and working that way, we can see that table has two children of that partnership, Winstanley and a Katherine, who in this case has the lbtnumber of 16225. | ||
In other words, starting with the <code>Parents</code> table, we find one pair of parents who have eight children, but starting with the <code>Children</code> table, we find a different pair of parents who have only two children. This contradiction needs a solution in the data. | In other words, starting with the <code>Parents</code> table, we find one pair of parents who have eight children, but starting with the <code>Children</code> table, we find a different pair of parents who have only two children. This contradiction needs a solution in the data. | ||
Revision as of 18:02, 26 February 2026
Overview
Extracting the information from Michael Turner's original Access database was an example of what we can call digital archaeology: scraping away layers of revised or redundant data, identifying and clearing experiments that didn't work, rejecting partially completed or eventually ignored tables, and incrementally building a script to pull out the gold coins and silver treasures for presentation in this wiki.
This description of the process of identifying and extracting the information from the database is not meant to disparage the years of effort that Michael Turner and the rest of the team spent on this project. It is a herculean effort that resulted in 30,906 individuals being described in painstaking detail across hundreds of years of book making history in London.
The original database
The MS Access database retrieved from Michael Turner's laptop had 50 tables in it. These were extracted using mdbtools and imported into a MySQL database. There were few indexes and no foreign keys in the original tables, but the number of records is small so doing without is not a problem. We ran a home-grown database profiler on the new database to get a sense of what data are in each table and column. There are 365 columns across the 50 original tables.
Changes to the original Wiki after initial load
We can see by inspecting the logs of the original wiki created by ULCC that many edits were made after it was generated from the database. Many of these edits were formatting changes with no updates to the information content. We are still reviewing these changes to ensure that no significant body of information is contained in these edits which we would need to extract and update into the current database.
Specific Issues
Occupations
The occupations table originally had 699 different occupations across the 18,736 individuals with an occupation. Some of the multiplicity resulted from typos: Bookeller, Bookeseller, Bookselller, Bokseller, Booskeller, Booksellser, Bookkseller, Bokkseller, Bookiseller, etc. Some others were different ways of describing the same occupation: Paper Maker vs Paper Manufacturer. We decided to correct the obvious typos and otherwise preserve the occupations as originally listed for each member of the database. Additionally, we created a shorter list of 65 occupations into which we could group the detailed occupations; these we used to create wiki Categories.
For example, the Category "Performing Arts" collects together anybody with the original occupation of Actor, Actress, Composer, Dramatist, Gentleman of the Chapel Royal, Musician, Opera manager, Organist, or Playwright. See the complete list of occupations and Categories for details.
Relationships
The database has separate tables for Parents, Children, Siblings, Marriages (dates only), and Partnerships (ie, marriages). All these data were abstracted into a Relationship table linking two people and identifying the relationship type. The resulting table has 12,449 records detailing two kinds of relationships: spouse and child.
For example, Thomas Leach (32366) is a child from the marriage of Dryden Leach (19861) and Elizabeth Ayres (4476). This is represented by a relationship record of type "spouse" between Dryden and Elizabeth, and two relationship records of type "child" between Thomas and each of Dryden and Elizabeth. Since we also have relationship records showing other children of Dryden and Elizabeth, we can show those individuals as siblings when creating the page for Thomas Leach. Adding grandparents, cousins, etc, are merely an addition to the database queries used when assembling the Family Relationships section of the person page.
Note that it was necessary to add individuals to the list of persons in the database. Just as wives of printers have their own LBT number so they show up in the `Partnerships` table even when they have no bookd trade occupation, so too we need LBT numbers for children so they can be referenced in the new `Relationship` table. There are partnership numbers in the original `Children` table that do not match any records in the `Partnerships` table, so the children in those cases will not have any family information on their person page. There are 114 children and 520 marriages that do not refer to any records in the `Partnerships` table, and so there will be holes for those individuals.
Identifiers
Every person in the database is represented by at least a record in the `Identity` table indexed by the `LBTNumber`. This table also has a column called `IDNumber`which we assume is a previous numbering system used before LBT Numbers were created. This only presents a problem when there is a cross reference in a comment or text field somewhere using the old number not the new one.
Some additional code was needed to recognise both incomplete references to the LBT number (since wikis only understand links to page titles) and references to old numbers, and change both to the correct page title for the intended person.
Calendar
The calendar table has 101,940 records: one for each day of the years 1557 to 1830 plus some extra non-contiguous days going back as far as 1357 and forward to 1915.
All days are marked with the current regnal year, and Stationers Company ("St.Co.") "Court", "Pension Court", and "Engl.Stock Dividend." days are marked.
We've decided that this can best be combined with the `Events` table so that we can show the events across the entire population in calendar sequence. The new Calendar page lists the centuries that are populated, and a user may browse down to the year and day of interest. From the day page from the calendar, the user can link to any of the people referenced on the page.
Scraping the old wiki's -ASS pages adds nearly 64,000 event records to the `events` table. This may add too much information to the person pages, but we've made the Events table collapsable in the wiki interface, so it is easy to obscure it if it is in the way.
Unresolved Issues Still Being Worked On
Children
Some people appear to have more children than is expected, and some of the children seem duplicated. This may be a consequence of assumptions made while extracting the data into the relationship table from overlapping and redundant tables parents, children, siblings, relationships, and marriages.
Let's illustrate with an example. Take BATTERSBY, Katherine ‹LBT16225›. S Starting with the Parents table, we see her father ss 13464 and her mother is 3732. The Partnerships record with those two members is number 6812. The Children table has eight records pointing to that Partnership; those eight include a Katherine but she doesn't get an original lbtnumber, so her identity is ambiguous. The Parents table has 11409 and 3549 as Katherine's parents, and they are in partnership 5027. On the other hand, starting with the Children table, we see relationship 5027 as the parents of 16225, and working that way, we can see that table has two children of that partnership, Winstanley and a Katherine, who in this case has the lbtnumber of 16225.
In other words, starting with the Parents table, we find one pair of parents who have eight children, but starting with the Children table, we find a different pair of parents who have only two children. This contradiction needs a solution in the data.
My first attempt will be to believe the Parents table if it is contradicted by the Children table.
English Stock
We are still researching how to best use the partial information on ownership of English Stock.
Livery Company
Because of its importance, we have listed membership in the Stationers' Company in 2.0; other livery companies will be listed in a later release.
Life Events
The events in the Births, Baptisms, Deaths, and Burials tables need to be checked back to Events to ensure that all available data have been used.
Currently, the data from these four tables and their four corresponding `_sources` tables are compiled into the "Life Events" table for each person page. The data are also analysed to infer the best canonical birth and death dates for use in the page title after the name. Also, we are experimenting with a way to identify flourishing dates.