Data Conversion Notes: Difference between revisions

From London Book Trades
Jump to navigation Jump to search
Move calendar into done
Line 42: Line 42:
Some additional code was needed to recognise both incomplete references to the LBT number (since wikis only understand links to page titles) and references to old numbers, and change both to the correct page title for the intended person.
Some additional code was needed to recognise both incomplete references to the LBT number (since wikis only understand links to page titles) and references to old numbers, and change both to the correct page title for the intended person.


==Unresolved Issues Still Being Worked On==
===English Stock===
We are still researching how to best use the partial information on ownership of English Stock.
===Calendar===
===Calendar===
The calendar table has 101,940 records: one for each day of the years 1557 to 1830 plus some extra non-contiguous days going back as far as 1357 and forward to 1915.
The calendar table has 101,940 records: one for each day of the years 1557 to 1830 plus some extra non-contiguous days going back as far as 1357 and forward to 1915.
Line 52: Line 49:
We've decided that this can best be combined with the `Events` table so that we can show the events across the entire population in calendar sequence. The new Calendar page lists the centuries that are populated, and a user may browse down to the year and day of interest. From the day page from the calendar, the user can link to any of the people referenced on the page.
We've decided that this can best be combined with the `Events` table so that we can show the events across the entire population in calendar sequence. The new Calendar page lists the centuries that are populated, and a user may browse down to the year and day of interest. From the day page from the calendar, the user can link to any of the people referenced on the page.


We'll keep "Calendar" as a topic still being worked on, as there are many court events that we need to scrape from the old wiki and add to the database (just as there are a lot of events in the database that don't appear in the wiki, but those aren't a problem for us).
Scraping the old wiki's -ASS pages adds nearly 64,000 event records to the `events` table. This may add too much information to the person pages, but we've made the Events table collapsable in the wiki interface, so it is easy to obscure it if it is in the way.
 
==Unresolved Issues Still Being Worked On==
===English Stock===
We are still researching how to best use the partial information on ownership of English Stock.


===Livery Company===
===Livery Company===

Revision as of 10:43, 23 February 2026

Overview

Extracting the information from Michael Turner's original Access database was an example of what we can call digital archaeology: scraping away layers of revised or redundant data, identifying and clearing experiments that didn't work, rejecting partially completed or eventually ignored tables, and incrementally building a script to pull out the gold coins and silver treasures for presentation in this wiki.

This description of the process of identifying and extracting the information from the database is not meant to disparage the years of effort that Michael Turner and the rest of the team spent on this project. It is a herculean effort that resulted in 30,906 individuals being described in painstaking detail across hundreds of years of book making history in London.

The original database

The MS Access database retrieved from Michael Turner's laptop had 50 tables in it. These were extracted using mdbtools and imported into a MySQL database. There were few indexes and no foreign keys in the original tables, but the number of records is small so doing without is not a problem. We ran a home-grown database profiler on the new database to get a sense of what data are in each table and column. There are 365 columns across the 50 original tables.

Changes to the original Wiki after initial load

We can see by inspecting the logs of the original wiki created by ULCC that many edits were made after it was generated from the database. Many of these edits were formatting changes with no updates to the information content. We are still reviewing these changes to ensure that no significant body of information is contained in these edits which we would need to extract and update into the current database.

Specific Issues

Occupations

The occupations table originally had 699 different occupations across the 18,736 individuals with an occupation. Some of the multiplicity resulted from typos: Bookeller, Bookeseller, Bookselller, Bokseller, Booskeller, Booksellser, Bookkseller, Bokkseller, Bookiseller, etc. Some others were different ways of describing the same occupation: Paper Maker vs Paper Manufacturer. We decided to correct the obvious typos and otherwise preserve the occupations as originally listed for each member of the database. Additionally, we created a shorter list of 65 occupations into which we could group the detailed occupations; these we used to create wiki Categories.

For example, the Category "Performing Arts" collects together anybody with the original occupation of Actor, Actress, Composer, Dramatist, Gentleman of the Chapel Royal, Musician, Opera manager, Organist, or Playwright. See the complete list of occupations and Categories for details.

Relationships

The database has separate tables for Parents, Children, Siblings, Marriages (dates only), and Partnerships (ie, marriages). All these data were abstracted into a Relationship table linking two people and identifying the relationship type. The resulting table has 12,449 records detailing two kinds of relationships: spouse and child.

For example, Thomas Leach (32366) is a child from the marriage of Dryden Leach (19861) and Elizabeth Ayres (4476). This is represented by a relationship record of type "spouse" between Dryden and Elizabeth, and two relationship records of type "child" between Thomas and each of Dryden and Elizabeth. Since we also have relationship records showing other children of Dryden and Elizabeth, we can show those individuals as siblings when creating the page for Thomas Leach. Adding grandparents, cousins, etc, are merely an addition to the database queries used when assembling the Family Relationships section of the person page.

Note that it was necessary to add individuals to the list of persons in the database. Just as wives of printers have their own LBT number so they show up in the `Partnerships` table even when they have no bookd trade occupation, so too we need LBT numbers for children so they can be referenced in the new `Relationship` table. There are partnership numbers in the original `Children` table that do not match any records in the `Partnerships` table, so the children in those cases will not have any family information on their person page. There are 114 children and 520 marriages that do not refer to any records in the `Partnerships` table, and so there will be holes for those individuals.

Identifiers

Every person in the database is represented by at least a record in the `Identity` table indexed by the `LBTNumber`. This table also has a column called `IDNumber`which we assume is a previous numbering system used before LBT Numbers were created. This only presents a problem when there is a cross reference in a comment or text field somewhere using the old number not the new one.

Some additional code was needed to recognise both incomplete references to the LBT number (since wikis only understand links to page titles) and references to old numbers, and change both to the correct page title for the intended person.

Calendar

The calendar table has 101,940 records: one for each day of the years 1557 to 1830 plus some extra non-contiguous days going back as far as 1357 and forward to 1915.

All days are marked with the current regnal year, and Stationers Company ("St.Co.") "Court", "Pension Court", and "Engl.Stock Dividend." days are marked.

We've decided that this can best be combined with the `Events` table so that we can show the events across the entire population in calendar sequence. The new Calendar page lists the centuries that are populated, and a user may browse down to the year and day of interest. From the day page from the calendar, the user can link to any of the people referenced on the page.

Scraping the old wiki's -ASS pages adds nearly 64,000 event records to the `events` table. This may add too much information to the person pages, but we've made the Events table collapsable in the wiki interface, so it is easy to obscure it if it is in the way.

Unresolved Issues Still Being Worked On

English Stock

We are still researching how to best use the partial information on ownership of English Stock.

Livery Company

Because of its importance, we have listed membership in the Stationers' Company in 2.0; other livery companies will be listed in a later release.

Life Events

The events in the Births, Baptisms, Deaths, and Burials tables need to be checked back to Events to ensure that all available data have been used.

Currently, the data from these four tables and their four corresponding `_sources` tables are compiled into the "Life Events" table for each person page. The data are also analysed to infer the best canonical birth and death dates for use in the page title after the name. Also, we are experimenting with a way to identify flourishing dates.