Some people eat, sleep and chew gum, I do genealogy and write...

Sunday, October 30, 2016

Can Indexing of Historical Records be Done by Computer Programs?


In a recent blog post from FamilySearch.org entitled, "What’s New on FamilySearch—October 2016," there is an announcement that in conjunction with GenealogyBank.com, an obituary collection in the FamilySearch.org Historical Record Collections was partially computer indexed. Indexing as done by FamilySearch is a relatively labor intensive activity. Every indexed document is manually indexed by two separate indexers and then subject to "arbitration" by a third person who reviews the indexing done by the other two. Presently, there is a huge backlog of indexed records that need to be "arbitrated." Additionally, because of the pace of digitization, there are many more digitized documents than there are ones that have been indexed.

If we assume that indexing is "necessary" rather than a convenience, then the task of indexing all of the billions of records waiting to be indexed might seem overwhelmingly difficult. There are two main possibilities as I see it. We can dramatically increase the number of individual indexers and arbitrators or we can employ some already existing data processing techniques. The big obstacle to using computers to do the indexing of many historical records is rather simple: handwriting. Only very recent historical records are in a printed or typewritten format that makes them subject optical character recognition and therefore subject to computerized indexing.

Optical character recognition or OCR has come a long way from its initial introduction in the early 1900s to transcribe text for people who were blind. Today, we have millions upon millions of online ebooks that have been generated by optical character recognition programs. Meanwhile, efforts to read handwriting with OCR has become one of the yet fully developed and much sought after goals of computerized indexing efforts. Handwriting recognition programs are commonly used, mostly with human backup, in U.S. Mail distribution and many other industries. The challenge posed by the historical records is mainly really bad handwriting and faded documents.

Any genealogist who has done research into old, handwritten documents can attest to the difficulty in deciphering old handwriting. Hence, the army of human indexers. But there is no reason why any printed or typewritten genealogically significant documents could not be indexed, perhaps with a human review, by computer programs. In fact, any machine readable document should be entirely read and rendered searchable by any word or character string in the document. FamilySearch.org has done just that with its online Books collection that current has over 312,000 digitized books online.

I might note, that for some strange reason, the digitized books on FamilySearch.org are not included in the searches made in the Historical Record Collections so the two different collections must be searched separately.

I am aware of significant efforts being made in handwriting recognition, but this area is still not to the point where it will replace much of the human involved efforts. Meanwhile, I would suggest that computer-aided OCR can make a significant advance in indexing printed and typewritten documents and that this should be done and the results made available to the genealogical community one way or another.


1 comment:

  1. Great writeup on the advances of indexing. It will so interesting to see how indexing evolves.

    ReplyDelete