Huygens/CLARIAH Geovistory pilot

2 posts / 0 new

12 April, 2019 - 13:09

lpetram

Short recap of what we discussed in Leipzig

We will work together on modeling and linking the data we selected for the Huygens/CLARIAH-Geovistory pilot, to be presented on 20 June 2019 in Amsterdam
The aim is not to do everything perfectly, but to show 1) the potential of Geovistory, and 2) the value of data modeling done properly
We will join efforts with Pierre Vernus, as his data is broadly related to the Huygens data

Some (additional) explanation of the data

We at the Huygens Institute really like the way in which Geovistory works with a shared repository of entities, from which various projects can ‘pick and choose’ and thus create their own scope of the data. Within the three structured datasets we selected, there are persons (or ‘Actors’ in CIDOC-CRM-speak), ships and commodities (or ’Things’), events and places. It would be great if we could model these basic entities (perhaps re-using models developed by George Bruseker — see below, under Action points) and go some way in interconnecting them. A second major Geovistory feature we’d like to showcase is how it allows users to extract structured data from texts, and thus at the same time add structure to these texts. This is where the fourth collection (the letters compiled in and sent from the Dutch East Indies) comes into play.

Structured data, basics — grouped per entity

Let’s start with the ships:

We have a complete list of ships of the Dutch East India Company (1855 records), with info on yard that built the ship and tonnage. Dataset: Dutch-Asiatic Shipping (DAS). Ships have unique ids.

Persons:

Important note to start with: the data on persons in these sources are for the most part not yet disambiguated. We have observations of persons who were on board of a certain ship during a certain journey. We have multiple observations of the same persons, but most observations have not yet been resolved to individuals.

We have data on the persons on board of some of the journeys between Europe and Asia
- Names of the masters on all voyages from Europe to Asia and back. Dataset: DAS.
- Names, places of birth and details of employment of crew on selection of outward and homebound voyages (c. 800,000 records). Links with ships established through voyage-id. Additional links for masters with data on masters from previous bullet through master-id. Dataset: VOC-opvarenden (VOCOPV).

Cargo:

We have data on type and value of cargo of a selection of 18th-c intercontinental and intra-Asiatic voyages. Not yet aligned with a commodities taxonomy. Links with voyages (and ships) established through voyage-id. Dataset: Bookkeeper-General (BGB).

Events:

We have data on the journeys the ships made between Europe and Asia (dates and places of departure / arrival available). Links with ships established through ’shipid'. The voyages have their own unique ids. Dataset: DAS.
For the 18th c, we have additional data on some of these intercontinental journeys. We also have data on journeys within Asia (dates and places of departure / arrival available). Links with previous bullet established though voyage-id. Dataset: (BGB).
We have work records for crew members on voyages from Europe to Asia (start and end data of employment, rank). Dataset: VOCOPV.

Locations:

We have standardized and geo-referenced locations of yards where ships were built (reconciled to either GeoNames or Wikidata). Dataset: DAS.
We have standardized and geo-referenced locations of places of departure and arrival of voyages from Europe to Asia (reconciled to either GeoNames or Wikidata). Dataset: DAS.
We have locations of places of departure and arrival of intra-Asiatic voyages; as yet neither standardized nor geo-referenced. Dataset: BGB.
We have locations of places of birth of crew members (c. 800,000 observations; 150,000+ unique toponym attestations; c. 35,000 unique attestations / 640,000 observations have been standardized, geo-referenced and reconciled to either GeoNames or Wikidata). Dataset: VOCOPV.

Unstructured data

There are many references to the entities from the structured data in the fourth data collection, the Official letters of the VOC. We have as yet not established links between the mentionings in the letters and the structured data; the NER-output in the hOCR-files was very preliminary. It would be great if we could show how Geovistory allows users to explore (already established) links between textual sources and records of entities and to establish new links.

Action and discussion points

Action: Contact George Bruseker for information on data modeling done within SeaLiT project (http://www.sealitproject.eu) and Swiss data aggregation project, where a number of core entities of historical datasets were defined (George developed data model under the conceptual umbrella of CIDOC CRM). (assigned to Lodewijk; e-mail sent on 11 April 2019; awaiting response).
Discussion: data format of structured data
- I pointed you to Druid triple store, where the structured data sources are available. It’s important to note that the data were converted into rdf within a very short time span and with a specific goal in mind (the data story on https://stories.datalegend.net/netwerk-maritieme-bronnen/). The structure is far from perfect and some values are missing.
- My suggestion would be to start the ingest into the Geovistory backend from the original files: Excel / csv for DAS and VOCOPV, MySQL for BGB. Given the scope of this pilot, I would suggest to focus on the basic properties of the main entities described above (the datasets contain additional data, which would be nice to have of course, but only if time permits).
- How do you think about this?
Discussion: data format of text
- I provided the Official letters as hOCR files. The boxes around the individual words point to their location on the scans of the book volumes in which the letters were published on our IIIF server (cf. images available through this url: https://beta.resources.huygens.knaw.nl/resourcesorb/?categories[0][0]=Generale%20missiven).
- Is this workable for you, or would you prefer another format (plain txt, tiff images — you could perhaps load these directly from our IIIF-server)?
Discussion: linkage
- As said before, NER-output of the Official letters is very rudimentary. However, since we have indices to these books, we could provide better NER (e.g. for a number of pages or, depending on the availability of specialists, one volume). We could also (manually) link a number of persons / ship’s names from the collections of entities with the Official letters. This would allow us to showcase the integration of text and structured entities in Geovistory.
- We have done quite a bit of work on resolving the person observations in VOCOPV to individuals, resulting in a set of c. 50,000 individuals who sailed to the East Indies more than once. We cannot yet publish this whole dataset (as it is part of an ongoing research project), but we could use a selection. This could then be used to showcase how Geovistory builds entities from observations/statements. We could also show how Geovistory allows users to find complementary observations.

9 May, 2019 - 20:12

lpetram

Data explanation

As promised, I’ve made available the Huygens datasets in CSV format. You can access the data folder through this link (password sent by e-mail). In the following, I will give a short explanation of the files and their columns. This diagram shows how the files are interrelated:

File explanation:

vocop: c. 800,000 employment records of VOC crew members, normalized, with references to DAS (see 2.) and vocPlaces (see 3.); also included: vocop_careers.csv: 7614 clusters of VOCOP records showing careers of desambiguated crew members (we have more clusters, but this is part of ongoing research)
das: data on the voyages of VOC ships from the Netherlands to Asia and back, normalized, with a few references to 1. and 3.
bgb: additional data on voyages of VOC ships, also voyages within Asia, normalized, with references to DAS (see 2.).
vocPlaces.csv: gazetteer for locations in 1-3, with GeoNames/Wikidata URIs and lat/long values

Collection: vocop

vocop.csv
1. ID: VOCOP id number (assigned by National Archives — the institution that created this data collection)
2. fullName: full name
3. firstName: first name
4. patronymic: patronymic
5. familyNamePrefix: family name prefix
6. familyName: family name
7. placeOrigin: place of birth (original)
8. vocPlaceID: id of place of birth, refers to vocop_place.csv
9. rank: id of rank, refers to vocop_rank.csv
10. dateBeginService: start date of work relation
11. dateEndServiceNoZeros: end date of work relation
12. reasonEndService: reason why employment ended; 'Chamber [placename]' means the sailor was dismissed from service after returning in the Netherlands by the VOC branch mentioned
13. endServiceWhere: location where employment ended; this is often the name of the ship on which the sailor returned to the Netherlands
14. voyageID: DAS voyageID of outward voyage, refers to das_voyage.csv
15. monthLetter: 'Ja' when the employee appointed a beneficiary who could collect 3 months' worth of pay (every year)
16. debtLetter: 'Ja' when the employee received advance money on signing up with the VOC (often used to pay for clothes and other gear needed on board)
17. generalRemark: general remarks
18. boardedAtCape: 'Ja' when record describes an employment that started at the Cape of Good Hope (is slightly complex: employment that started at the Cape are often nested within employment that started in NL; a sailor could e.g. start service in Amsterdam, sail to the Cape on ship 1, and then change ships at the Cape to ship 2, which would then be described in a new record that has a 'Ja' in this field)
19. boardedAtCapeVoyageID: if boarded at Cape: DAS voyageID of ship for leg Cape - East Indies, refers to das_voyage.csv
20. DASvoyageReturnID: DAS voyageID of return voyage, refers to das_voyage.csv
21. sourceReference: reference to physical pay ledger in National Archives
22. scanPermalink: URI of scan of original pay ledger record
vocop_place.csv
1. VOC_placeID: id, refers to vocop.csv
2. placeOrigin: toponym attestation as in source
3. vocUniqueStandardizedToponymID: id of standardized toponym refers to voc_places.csv
vocop_rank.csv
1. vocRankID: id, refers to vocop.csv
2. rank: rank on board
3. wage: minimum standard wage
4. HISCO_CODE: HISCO category for rank
5. HISCO_URI: HISCO URI (currently non-resolving; work in progress)
vocop_careers.csv
1. clusterID: ID of career cluster
2. clusterRow: row number of record in cluster
3. VOCOP_id: VOCOP id of record that forms part of cluster

Collection: das

das_voyage.csv
1. voyId: DAS voyage ID
2. voyNumberDAS: original DAS voyage number
3. shipNameVariantID: id of ship name, refers to das_shipNameVariant.csv
4. voyMasterID: id for master (i.e. captain) of this voyage, refers to file das_master.csv
1. voyMasterRemark: remark on master
2. voyChamberID: id for VOC chamber (i.e. branch of company) that administered this voyage, refers to file das_chamber.csv
3. voyDepartureEDTF: departure date of voyage
4. voyDeparturePlaceID: id for departure place of voyage, refers to das_place.csv
5. voyCapeArrivalEDTF: arrival date at Cape of Good Hope
6. voyCapeArrivalEDTF_Remark: remark on arrival date at Cape of Good Hope
7. voyCapeDepartureEDTF: departure date from Cape of Good Hope
8. voyCapeDepartureEDTF_remark: remark on departure date from Cape of Good Hope
9. voyArrivalDateEDTF: arrival date of voyage
10. voyArrivalDateEDTF_remark: remark on arrival date of voyage
11. voyArrivalPlaceID: id for arrival place of voyage, refers to das_place.csv
12. voyInvoiceValue: value of goods transported
13. voyChamber2ID: [don’t know? => irrelevant]
14. voyParticulars: general remark on voyage
15. voyCorrespondingNumber: [irrelevant]
16. voyRGPDeel: reference to DAS book volume in which voyage was described
17. voymaster_VOCOPVid: id of VOCOP record that describes employment of the master of this voyage
das_ship.csv
1. shipID: id number for ship
2. voyTonnageMin: minimum tonnage of ship
3. voyTonnageMax: maximum tonnage of ship (in case more than one tonnage in source_
4. voyTypeOfShipID: id of ship type, refers to das_shipType.csv
5. voyBuilt: [irrelevant]
6. voyBuiltRemark: remark on acquisition of ownership
7. voyBuiltY: year when ship was built / hired / bought
8. voyYardYardID: id of yard where ship was built, refers to das_yard.csv
das_shipNameVariant.csv
1. shipNameVariantID: id number of ship name variant
2. shipID: id number of ship, refers to das_ship.csv
3. shipNameVariant: name variant of ship
4. shipNameVariantRemark: remark on names of ship
das_shipType.csv
1. shipTypeID: is for ship type
2. voyTypeOfShip: ship type
3. voyTypeOfShipExternalID: external URI for ship type
das_master.csv
1. voyMasterID: id for masters (i.e. captains)
2. voyMasterLastName: last name of master
3. voyMasterFirstName: first name of master
4. voyMasterFamilyNamePrefix: family name prefix of master
das_onboard.csv
1. onbId: id number for ‘onboard’ data
2. onbVoyageId: voyage id, refers to das_voyage.csv
3. onbCategory: crew category of below numbers (’total’ means whole crew, not divided in categories)
4. onbI: number of crew at departure
5. onbII: number of deaths between Netherlands and Cape
6. onbIII: number of crew that left ship at Cape
7. onbIV: number of crew that boarded at Cape
8. onbV: number of deaths during whole voyage
9. onbVI: number of crew upon arrival in Asia
das_yard.csv
1. yardID: id number for yard
2. yard: yard name
3. yardLocatedIn_standardizedToponym: place where yard was located
4. uniqueStandardizedToponymID: id for place, refers to voc_places.csv
chamber.csv
1. chamID: id number for chamber (i.e. VOC branch)
2. chamber: chamber name
3. chamberFullName: chamber full name
4. chamberLocatedIn_UniekeToponiemenVOCPOPV: place where chamber was located
5. uniqueStandardizedToponymID: id for place, refers to voc_places.csv
place.csv
1. placeID: id number for places mentioned in das_voyage.csv
2. toponym_original: toponym as mentioned in DAS
3. toponym_standardized: standardized toponym
4. uniqueStandardizedToponymID: id for place, refers to voc_places.csv

Collection: bgb

bgb_cargo.csv
1. carId: id number for cargo specification (each unit of cargo on board of ship during voyage gets an id number)
2. carVoyageId: id number of voyage on which cargo was transported, refers to bgb_voyage
3. carProductId: id number for product (i.e. type of cargo), refers to bgb_product
4. carSpecificationId: id number for specification of cargo, refers to bgb_specification
5. carUnit: id number for unit of account, refers to bgb_unit
6. carQuantity: quantity (in units)
7. carQuantityNumeric: quantity (metric value)
8. carValue: total value of cargo
9. carValueGuldens: value guilders
10. carValueStuivers: value stuivers
11. carValuePenningen: value penningen
12. carValueLicht: total value of cargo in Indian money
13. carValueLichtGuldens: value Indian guilders
14. carValueLichtStuivers: value Indian stuivers
15. carValueLichtPenningen: value Indian penningen
16. carRemarks: remarks on cargo
17. carOrder: order in which cargo should be published (i.e.: 1 = first line)
18. changed_when: provenance
19. changed_by: provenance
20. timestamp: provenance
21. all_fields: ?
bgb_place.csv
1. id: id for place, refers to bgb_voyage.csv
2. naam: toponym
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
6. regio: id for region in which place was located, refers to bgb_regio
bgb_product.csv
1. id: id for product (i.e. type of cargo), refers to bgb_cargo
2. naam: product name
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_regio.csv
1. id: id for region, refers to bgb_place
2. naam: region toponym
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_relVoyageShip.csv
1. id: id number for voyage-ship relation
2. voyId: id number of voyage, refers to bgb_voyage.csv
3. shipId: id number for ship, refers to bgb_ship.csv
4. timestamp: provenance
bgb_ship.csv
1. id: id number for ship, refers to bgb_relVoyageShip.csv
2. naam: name of ship
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_source.csv
1. id: id number of source reference, refers to bgb_voyage.csv
2. naam: inventory number of journal
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_specification.csv
1. id: id number of cargo specification, refers to bgb_cargo.csv
2. naam: description of extra specification
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_unit.csv
1. id: id number of unit of measure, refers to bgb_cargo.csv
2. naam: description of unit of measure
3. added_when: provenance
4. added_by: provenance
5. timestamp: provenance
bgb_voyage.csv
1. voyId: id number of voyage, refers to bgb_cargo.csv and bgb_relVoyageShip.csv
2. voyBookingDay: book date (day)
3. voyBookingMonth: book date (month)
4. voyBookingYear: book date (year)
5. voyDeparturePlaceId: id for place of departure, refers to bgb_place.csv
6. voyDepartureDay: departure date (day)
7. voyDepartureMonth: departure date (month)
8. voyDepartureYear: departure date (year)
9. voyArrivalPlaceId: id for arrival place, refers to bgb_place.csv
10. voyArrivalDay: arrival date (day)
11. voyArrivalMonth: arrival date (month)
12. voyArrivalYear: arrival date (year)
13. voyInvoiceValue: value of cargo
14. voyInvoiceValueLicht: value of cargo in Indian guilders
15. voyRemarksForEditor: remarks
16. voyDASNumber: corresponding DAS number of ship, refers to das_voyage.csv
17. created_when: provenance
18. created_by: provenance
19. changed_when: provenance
20. changed_by: provenance
21. timestamp: provenance
22. voySourceId: id of source reference, refers to bgb_source
23. voynumber: id number of voyage [irrelevant, used on website]
24. voyImage: [empty]
25. voyRemarksForEndUser: remarks
26. voyDepartureRegioId: id number for departure region, refers tot bgb_regio
27. voyArrivalRegioId: id number for arrival region, refers tot bgb_regio
28. voyFolioNummer: folio in source
29. all_fields: [irrelevant]
30. first_ship_name: name of first ship (if voyage concerns fleet)

File: voc_places.csv

uniqueStandardizedToponymID: id number of standardized toponym
uniqueStandardizedToponymCountryCode: standardized toponym, followed by country code
URI: external URI for place
LAT: lat value for place
LNG: long value for place

Main menu

You are here

Huygens/CLARIAH Geovistory pilot

Short recap of what we discussed in Leipzig

Some (additional) explanation of the data

Structured data, basics — grouped per entity

Action and discussion points