[Opengenalliance] English school records project

Wed Apr 27 03:52:32 BST 2011

Javier (et al.),

The proposal was just a bit of a brain dump based on reading the TNA
proposal, but I'm glad if it helps.

I see no reason why any tool for an indexing/transcription project
shouldn't be open-source.  If you end up getting funding and putting
it out for bidders, you can certainly make a F/OSS license--I suggest
the AGPL for this--a requirement.

You asked my opinion of Scripto.  I've followed its development quite
closely, and I'm afraid that like the other F/OSS transcription
solutions which have been released so far it's not appropriate for a
project entering structured data.  This is no mark against Scripto,
however, but a problem inherent in all four transcription tools that
are backed by wikis (Wikisource/ProofreadPage, FromThePage, Bentham
Transcription Desk, and Scripto) .

The issue is that each of our tools were designed to enable
transcription (and sometimes indexing, annotation, proof-reading and
review) of free-form text materials like diaries and letters.  What
you need for the school records project--and likely any other project
of primarily genealogical interest--is a system that allows users to
enter fields of each record directly into a database that can be
searched quickly and intelligently.  Visually you need something akin
to a grid control -- a spreadsheet-style entry screen with fields for
each of Student First Name, Student Surname, Parent First Name, Parent
Last Name, Enrollment Date, Departure Date, and whatever fields are
specific to a set of records.  FamilySearch Indexing does this well
with structured data originating from tabular manuscript forms, and
the North American Bird Phenology Program does this well (albeit on a
single-record-per-image basis) with structured data represented on
heterogeneous manuscript materials.  This structure applies to the
database as well, since researchers are going to want to search on
those fields, and you'll want those searches to perform well.  While
it's possible to extract structured data from a wiki (see dbpedia for
an example), the transcribers would be forced to add their own mark-up
to tag each word with the type of data it represented, which is a
significant usability problem and an invitation to corrupt data.

While applications exist for structured data entry (FamilySearch
Indexing, BPP's tool, OldWeather's tool, Ancestry's World Archives
Project) or for manuscript tagging (BYU's Historic Journals), I don't
think that any of them are F/OSS.  However, I have not been in contact
with most of them, and haven't followed that niche as closely as the
free-form niche so you may discover something different.

Ben Brumfield
http://manuscripttranscription.blogspot.com/
http://github.com/benwbrum/fromthepage/wiki

On Tue, Apr 26, 2011 at 10:14 AM, Javier Ruiz
<javier at openrightsgroup.org> wrote:
> Hi Ben
>
> I am ccing this to the Open Ge list to broaden the discussions.
>
> You are right about the difficulties. The TNA more or less see the LIA as
> income, but more importantly they care about not having to invest anything
> and be able to claim to have moved materials online (just don't mention the
> pay wall). There is a very good report on how it  works here:
>
> http://www.ithaka.org/ithaka-s-r/research/ithaka-case-studies-in-sustainability/case-studies/SCA_BMS_CaseStudy_NatArchives.pdf/view
>
> We are arranging a meeting with their head of digitisation, who was in the
> conference last week. We want to challenge their commercial approach but in
> a positive and friendly manner with a solution based on participation. They
> are actually very interested in crowdsourcing, but in parallel not as an
> alternative to LIA.
>
> You have done an amazing work researching the options for participatory
> digitisation of the schools registers, thanks so much. Even if this
> particular project does not happen in the end because of vested interests or
> just lack of time /cash to organise it for the bid deadline, the development
> of an open digitisation workflow is a very important element of opening up
> the data. The conference last week was looking at upgrading some tools for
> specific projects, so something will happen for sure.
>
> We will contact The Internet Archive, although I am not sure they can help
> with the scanning directly if we are in UK. Our initial view was that we
> could build the scanning beds quite easily, as I send you in another email.
>
> The side of hosting of images and API looks more interesting. There are a
> few people in this list with connections to web resources, such as
> Wikimedia, and hopefully they will chip in some ideas.
>
> The indexing / transcription tool is where I think there is more work to do
> and where you really have some unique insights. We hoped to build on and
> develop under open source as much as possible so the tools remain accesible
> for other projects and groups, rather than licensing for a specific project.
> Of course that there are costs associated with this approach as well, and
> it's great to have an initial estimate of effort. What is you view of
> scripto.org? do you think it could be a good basis for structured texts?
>
> We will not outsource the actual transcription as participation is a key
> element throughout, although there are costs involved in
> training, particularly for scanning.
>
> The actual serving of the records has two sides. On the one hand people need
> genealogical data integrated in searches and delivered consistently, rather
> than scattered across zillions of websites, and this is one area I hope the
> OGA can really make a difference one day. On the other hand, you could
> develop a very rich experience of the particular record collection, schools
> in this case, with extra material and an immersive experience possibly with
> photos, lesson plans, stories by pupils, etc. It's not clear how much the
> project would need to deliver on the public interface.
>
> I will let you know how the meeting with GalaxyZoo goes and also about
> concrete tool development plans by some partners.
>
> best, javier
>
[re-appending my original email, which Javier intended to quote in his
original CC --BWB 2011-04-26]
On 21 Apr 2011 15:27, "Ben Brumfield" <benwbrum at gmail.com> wrote:

Javier,

I just wanted to follow up on our conversation yesterday.  I think you
were planning to email me some more information about your group and
your thoughts on the project, but if you did, I'm afraid that I didn't
receive anything -- could you try again?

I've done a bit more research on TNA's LIA program, and I must say
that it looks like you're battling a pretty entrenched way of doing
things.  The LIA program's mandatory requirement that licensees
implement a "'pay-per-view' charging model" indicates to me that
whoever drafted the LIA program sees it primarily as a revenue source
for TNA -- a priority that runs directly counter to your goals.

That said, you asked me for my advice on tools and methodologies, and
absent any other information about your existing plans, I'll just riff
on the TNA's posting.

Scanning:
I'd work with the Internet Archive to scan the material -- they're
cost-effective, proven, and open.  Nevertheless, this will likely
require substantial funding.

Image Hosting:
The great benefit of the Internet Archive is that their Archive.org
software and site will host content like this for free.  They're
already hosting the US Census:
http://www.archive.org/details/us_census , and while copyright on
government material is different (in lamentable ways) in the UK, this
might well be right up their alley.

Aside from cost, the second great benefit to using Archive.org is
their commitment to openness -- they have an extensive API that allows
third parties to build applications on top of the content they host.
My own FromThePage software embeds their BookReader to serve scanned
page images, but I'm only scratching the surface of what they can do.

Indexing Tool:
Enabling your network of volunteers to index the data requires a tool
capable of dealing with heterogeneous structured data.  To my
knowledge, no such tool is publicly available, although iArchives
might be willing to re-sell their FamilySearch Indexing tool, and the
Zooniverse folks have written that their excellent OldWeather tool
should be re-usable.  (You wrote that you were meeting them soon, and
I'll be very interested to know how that goes -- they're a great
outfit!)  The Amsterdam City Archives got quite a response to their
VeleHanden RFP, so you might have good luck putting the tool out for
bid -- in fact I've read one of the failing proposals and was
impressed by its quality.

Even an existing tool is going to require some modifications, however,
and your choice of tool will have implications on your scanning
strategy.  (FamilySearch hosts all their own images, so a
loosely-connected integration with Archive.org won't work; OldWeather
may also have their own ways of doing things too.)  If I were doing
this myself, I'd build a tool that displays images and metadata (page
order, etc) from the Internet Archive with a data-entry grid control.
Because the records are so diverse, you're going to want to decide
whether you're just transcribing information of bare genealogical
relevance (student name, parent name) or whether you're transcribing
the entire record.  The latter approach--which I prefer--implies data
entry fields that are different for each record set, with a
flexible/extensible data model to match.

All of this tool work--whether you modify something already built or
create something from scratch--requires funding: I'm guessing you'd
need to pay a programmer and a designer for approximately a year/4
months work, respectively.

Indexing Operations:
You'll need to host the tool somewhere--a task that's growing
increasingly cheaper--and manage the community of people doing the
transcription.  If you end up outsourcing the transcription via
Mechanical Turk, you'll need some additional funds to pay the
indexers, not to mention a bit more technical modifications to the
tool to make that maintainable.   You'll also need more development
effort for bug fixes and enhancements based on the tool's contact with
actual users.

Record Search/Display
Aside from the indexing tool, you're going to need a tool for people
to use for searching and viewing the indexed records.  This is
substantially simpler than the indexing tool, although it will likely
require more design and scalability work if it gets a lot of use.
Fortunately this may also leverage the Internet Archive for record
display.

Those are all the pieces I can predict, but there may well be more
pieces involved.  Feel free to send me any other information you have
on the proposal to date -- I'm very interested in tracking this
project.

Best of luck!

Ben Brumfield
http://manuscripttranscription.blogspot.com/
http://github.com/benwbrum/fromthepage/wiki