Short recap of what we discussed in Leipzig
- We will work together on modeling and linking the data we selected for the Huygens/CLARIAH-Geovistory pilot, to be presented on 20 June 2019 in Amsterdam
- The aim is not to do everything perfectly, but to show 1) the potential of Geovistory, and 2) the value of data modeling done properly
- We will join efforts with Pierre Vernus, as his data is broadly related to the Huygens data
Some (additional) explanation of the data
We at the Huygens Institute really like the way in which Geovistory works with a shared repository of entities, from which various projects can ‘pick and choose’ and thus create their own scope of the data. Within the three structured datasets we selected, there are persons (or ‘Actors’ in CIDOC-CRM-speak), ships and commodities (or ’Things’), events and places. It would be great if we could model these basic entities (perhaps re-using models developed by George Bruseker — see below, under Action points) and go some way in interconnecting them. A second major Geovistory feature we’d like to showcase is how it allows users to extract structured data from texts, and thus at the same time add structure to these texts. This is where the fourth collection (the letters compiled in and sent from the Dutch East Indies) comes into play.
Structured data, basics — grouped per entity
Let’s start with the ships:
- We have a complete list of ships of the Dutch East India Company (1855 records), with info on yard that built the ship and tonnage. Dataset: Dutch-Asiatic Shipping (DAS). Ships have unique ids.
Persons:
Important note to start with: the data on persons in these sources are for the most part not yet disambiguated. We have observations of persons who were on board of a certain ship during a certain journey. We have multiple observations of the same persons, but most observations have not yet been resolved to individuals.
- We have data on the persons on board of some of the journeys between Europe and Asia
- Names of the masters on all voyages from Europe to Asia and back. Dataset: DAS.
- Names, places of birth and details of employment of crew on selection of outward and homebound voyages (c. 800,000 records). Links with ships established through voyage-id. Additional links for masters with data on masters from previous bullet through master-id. Dataset: VOC-opvarenden (VOCOPV).
Cargo:
- We have data on type and value of cargo of a selection of 18th-c intercontinental and intra-Asiatic voyages. Not yet aligned with a commodities taxonomy. Links with voyages (and ships) established through voyage-id. Dataset: Bookkeeper-General (BGB).
Events:
- We have data on the journeys the ships made between Europe and Asia (dates and places of departure / arrival available). Links with ships established through ’shipid'. The voyages have their own unique ids. Dataset: DAS.
- For the 18th c, we have additional data on some of these intercontinental journeys. We also have data on journeys within Asia (dates and places of departure / arrival available). Links with previous bullet established though voyage-id. Dataset: (BGB).
- We have work records for crew members on voyages from Europe to Asia (start and end data of employment, rank). Dataset: VOCOPV.
Locations:
- We have standardized and geo-referenced locations of yards where ships were built (reconciled to either GeoNames or Wikidata). Dataset: DAS.
- We have standardized and geo-referenced locations of places of departure and arrival of voyages from Europe to Asia (reconciled to either GeoNames or Wikidata). Dataset: DAS.
- We have locations of places of departure and arrival of intra-Asiatic voyages; as yet neither standardized nor geo-referenced. Dataset: BGB.
- We have locations of places of birth of crew members (c. 800,000 observations; 150,000+ unique toponym attestations; c. 35,000 unique attestations / 640,000 observations have been standardized, geo-referenced and reconciled to either GeoNames or Wikidata). Dataset: VOCOPV.
Unstructured data
There are many references to the entities from the structured data in the fourth data collection, the Official letters of the VOC. We have as yet not established links between the mentionings in the letters and the structured data; the NER-output in the hOCR-files was very preliminary. It would be great if we could show how Geovistory allows users to explore (already established) links between textual sources and records of entities and to establish new links.
Action and discussion points
- Action: Contact George Bruseker for information on data modeling done within SeaLiT project (http://www.sealitproject.eu) and Swiss data aggregation project, where a number of core entities of historical datasets were defined (George developed data model under the conceptual umbrella of CIDOC CRM). (assigned to Lodewijk; e-mail sent on 11 April 2019; awaiting response).
- Discussion: data format of structured data
- I pointed you to Druid triple store, where the structured data sources are available. It’s important to note that the data were converted into rdf within a very short time span and with a specific goal in mind (the data story on https://stories.datalegend.net/netwerk-maritieme-bronnen/). The structure is far from perfect and some values are missing.
- My suggestion would be to start the ingest into the Geovistory backend from the original files: Excel / csv for DAS and VOCOPV, MySQL for BGB. Given the scope of this pilot, I would suggest to focus on the basic properties of the main entities described above (the datasets contain additional data, which would be nice to have of course, but only if time permits).
- How do you think about this?
- Discussion: data format of text
- I provided the Official letters as hOCR files. The boxes around the individual words point to their location on the scans of the book volumes in which the letters were published on our IIIF server (cf. images available through this url: https://beta.resources.huygens.knaw.nl/resourcesorb/?categories[0][0]=Generale%20missiven).
- Is this workable for you, or would you prefer another format (plain txt, tiff images — you could perhaps load these directly from our IIIF-server)?
- Discussion: linkage
- As said before, NER-output of the Official letters is very rudimentary. However, since we have indices to these books, we could provide better NER (e.g. for a number of pages or, depending on the availability of specialists, one volume). We could also (manually) link a number of persons / ship’s names from the collections of entities with the Official letters. This would allow us to showcase the integration of text and structured entities in Geovistory.
- We have done quite a bit of work on resolving the person observations in VOCOPV to individuals, resulting in a set of c. 50,000 individuals who sailed to the East Indies more than once. We cannot yet publish this whole dataset (as it is part of an ongoing research project), but we could use a selection. This could then be used to showcase how Geovistory builds entities from observations/statements. We could also show how Geovistory allows users to find complementary observations.
As promised, I’ve made available the Huygens datasets in CSV format. You can access the data folder through this link (password sent by e-mail). In the following, I will give a short explanation of the files and their columns. This diagram shows how the files are interrelated:
File explanation:
Collection: vocop
Collection: das
Collection: bgb
File: voc_places.csv