Saturday 30 November 2013

Is That a Fact?



When you record such things as a name, age, occupation, place-of-birth, etc., do you refer to them as ‘facts’ or something else? Are they held as simple text values in your database? Have you thought about the true nature of those data items?

As usual in the digital side of genealogy, we have a plethora of alternative terms for the same thing, and ambiguous interpretations of the more common terms. Genealogists are encouraged to refer to these data items as ‘facts’, although I have already made the point in Evidence and Where to Stick It that their facticity is dependent upon the source from which they came. A number of software developers prefer the term ‘PFACT’, which stands for property, fact, attribute, characteristic, or trait. However, this is squandering five perfectly good words – each with distinct meanings in normal usage – and so reducing the possibility of any of them being given distinct genealogical uses. I will be employing the more generic STEMMA® term of ‘Properties’ in this post.

So, what is a Property? You might say that it is an item of evidence[1] taken from a given source of information. This is a fair description, but as soon as you acknowledge that a Property is “an extracted and summarised item of information” then a number of issues have to be considered and solved for their digital representation. What I’m about to present is my own approach as to-date I’m not aware of any product that tackles all of these issues.

Foremost amongst the issues – and yet rarely discussed in the context of Properties – is the difference between what was written and your interpretation of it. Although this is a fundamental part of supporting evidence and conclusion, or E&C, I need to clarify that, here, this is purely the analysis and interpretation of each item rather than building them into any proof argument; that being a separate phase. For instance, if a place name has been misspelled, or is hard to read, then you need to record it as it was written (indicating any uncertain characters) together with your interpretation of what it should have been. In effect, each Property has two distinct values: the recorded one, including any transcription anomalies, and the interpreted one. As with any form of conclusion-making, you’ll also need a way to add any explanatory notes, and possibly add some level of confidence in your result. I will come back to this duality of Properties in a moment.

All Property values are implicitly associated with a particular time and place. For instance, someone’s name may have changed during their life, and someone’s age will certainly have changed over time. STEMMA copes with this because the Properties are associated with specific Event-to-Person connections[2] in the data, and the Event entity implicitly provides a relevant date for the interpretation and applicability of the value.

Another issue to consider is the nature of the Property. Is it the name of something (e.g. a person or place), a description (e.g. cause of death), a date, or a measure of something (e.g. age, height, weight)? This is termed its data-type. The importance of it lies with the interpreted value (rather than the written value) which should be computer-readable in order to make the most use of it. Whilst I acknowledge that there may be detractors to this statement, let me try and make a number of observations to justify it.

For the simple expedient of consistency checking, software needs to know whether a value should be textual, numeric (integer or real), or a date. More than this, though, a value such as a date can be used in a timeline, and an age can be used to derive dates and to separate events, so their values should be accessible to software. In the case of a person or place reference, these can be linked (using some type of pointer mechanism) to the corresponding Person or Place entity in the data. That linkage, which is as much a conclusion as the interpreted value of any date, is required in order to allow you to follow the reference to the entity’s details. However, the duality of the Property values doesn’t require you to change the name from how it was recorded at that time. Finally, in certain cases, a Property may have a representation that doesn’t correspond to a value in the normal sense, either because the written form was undecipherable or it had a special meaning. For instance, the use of “Full Age” for a young married couple, or “Unknown”, “N/A”, or “LNU” for an unknown name, are special non-values. There’s a golden rule that you do not record anything in a name field that isn’t actually a name[3]. Being able to distinguish the recorded form from an interpreted form avoids this issue.

If a Property is a measure of something, such as a height or weight, then the interpreted value needs to identify the units. In all but one case, it is debatable whether or not software will want to make use of these units themselves as opposed to simply distinguishing values held in different units. That exception involves the age of a person. Ages are normally recorded in years, but ages in months, weeks, or even days, are quite common for infant deaths. These may also be fractional rather than integer values, e.g. “3 ½ weeks”.

Some Properties are necessarily multi-valued. The most obvious case is a Role (i.e. the part a Person plays in an Event). For instance, a witness at a wedding may also have been a relative of either the bride or the groom. A computer representation must accommodate multiple values, and support the duality for each instance.

It would be folly to try and enumerate all possible Properties in advance of them being used. Different researchers, different sources, and different cultures, may all result in unanticipated Properties having to be recorded. What is required, therefore, is a scheme that allows custom Properties to be freely defined without some onerous, centralised registration process, and yet still allows those custom Properties to be loaded by any compliant product. This is certainly possible but it is such a widespread requirement – applying to many types, subtypes, and other sets of named values – that I plan to write about it separately.

If you’re still with me then you’re probably about to say ‘this is way too complicated Tony’. Before you finish preparing your response, though, consider these points:

  • We cannot assume that a recipient of your data has access to the same online images, and the T&C’s that you’ve checked probably prohibit you from sharing your images. Also, if you’re one of the minority who still visit archives, etc., then the originals may be locked away, and not copiable or online at all. In other words, our transcriptions can be invaluable. Hence, if we take shortcuts with those transcriptions – even for mere Properties – and assume that we know what the author meant without recording things verbatim (or even literatim), or fail to mention crossings-out and other annotation, then we’re diluting that effort and “short changing” some later recipient.
  • Do we want our genealogy products to simply record what we type in? If so then we might as well just use a word-processor. Providing more detail, and making it machine-readable, means that our products can work with the data to provide such things as analysis and consistency checking.


I’ll close by providing some links to a couple of worked examples in STEMMA for any code-junkies: Transcription Anomalies and Census Roles. Between them, these deal with many of the cases discussed here, including transcription anomalies, spelling errors, clarifications, and mis-recorded information.



[1] If anyone wants to comment that the evidence in any given source is more than a set of discrete values then I entirely agree. There is usually much context and information that cannot be distilled down to simple values. What we’re discussing here is just the digested pieces of information that many genealogists store in their databases, but also acknowledging that this alone is not fully representative.
[2] For historical references to places, the corresponding STEMMA Properties would be associated with Event-to-Place connections.
[3] This issue is covered in excellent detail by Tamura Jones, “FNU LNU MNU UNK”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/FNULNUMNUUNK.xhtml : accessed 22 Nov 2013); Also his previous works: “The Lnu Family Mystery”, Modern software Experience, 11 Aug 2013 (http://www.tamurajones.net/TheLnuFamilyMystery.xhtml : accessed 22 Nov 2013); “Unk is a Real Name”, Modern software Experience, 10 Aug 2013 (http://www.tamurajones.net/UnkIsARealName.xhtml : accessed 22 Nov 2013).

No comments:

Post a Comment