Boostrapping digital humanities support services for small teams
Introduction
This paper is partly devoted to pointing out several issues confronting eReseach via an example of two humanities researchers, but also tries to offer a potential solution to one of the major data management issues confonting the researchers; how to preserve their analysis of an image corpus, how to hand the images on to the university for safe-keeping, and how they might share their corpus more broadly.
In this paper the pronoun 'we' refers to the broad project of eResearch being undertaken by Austin and Hickey, in association with Peter Sefton and the team of software developers in the Australian Digital Futures Institute. I refers to the author, Sefton. Discussions with the researchers are reported in the text without continually citing them as personal communication.
Current tools and methodology
The researchers in the Signing the School study describe their methodology:
We have utilized a variety of photographic equipment, ranging from hard drive digital video cameras to high quality digital still cameras to mobile phones. The opportunities for powerful visual data gathering (and, by implication, more authentic visual ethnographic work) resident within new forms of communication technologies have become very evident to us in the conduct of this project1. Images were captured, transferred to iMac computers and the analysis conducted using Nvivo software. The data are sifted across two warps: one, looking for broad categories of message type (ie. commonalities across the school sites); the second cut of data involves looking for threads of development from individual school sites across time (mapping the development of a school’s image as conveyed to its community).[1]
The Nvivo software used for analysis is a kind of Computer-Assisted Qualitative Data Analysis Software (CAQDAS) for links to various packages and a discussion of their advantages and shortcomings see the site maintained by Longborough University [2].
For the researchers in the study this software is considered to be the industry standard. Austin has been using the program since its genesis as NUDIST [3]. The program started out as a cross platform Windows and Macintosh application, but upon its change to the new name Nvivio it is now a Windows-only proposition. This narrowing of the platforms has implications for data preservation, discussed below.
Members of the Education Theory Collective (ETC) approach the tool as a general aid to data management. The tool allows them to sort and view data, using coding schemes of their own devising - which we might call ontologies in the computer science sense - but do not attempt to use it for very fine-grained analysis. Austin explains that they are not 'pretending' to be scientific they are pursuing a program of work that is more along the lines of trying to have your reader “try on someone else's subjectivity”. [TODO- Jon do you have a reference for that?]
They also see the tool as not worthwhile to use when you have small projects because of the overhead of setting it up.
The researchers are aware of other applications in this space, and of the fact that data are not portable between them. But the main portability issue, given that Nvivio is considered standard in their milieu, is that it is difficult to collaborate with students, who may not be able to afford the software. Austin suggested in the discussions for this article that the university might be able to provide a licence-loan system, which led to a discussion of free software options.
Neither of the researchers are (yet) politicised with respect to the term free software. When I used the term “Free Software” in conversation, Austin was quick to recognise the benefits of free software to an inclusive sociological or ethnographic enterprise, himself refering to the benefits of 'freeware'. Choosing the term freeware is an indicator that the often made distinction between free-as-in-gratis (or free-as-in-beer) and free-as-in-libre [4] are not part of the research group's discourse, and yet their aspirations to an inclusive ethnographic practice [TODO: J&A what's a good paper to cite here?] are compatible with the aims of the Free Software and open source movements, quite apart from the practical benefits afforded by software with no licensing costs.
Another area where proprietary software licenses are a major issue is in data preservation. The Nvivo software used a proprietary data format and our eResearchers have not explored any of the available export formats.
Even leaving aside the issues of data portability The ETC collective have no formal method of archiving their research data – and the university does not supply one or have any systematic processes for encouraging researchers to develop their own. There is an ePrints repository where research outcomes such as papers and theses can be deposited, but it has not been used for storing data at this stage. We discuss some of the possibilities for data deposit for these data below.
The paper is in the ePrints repository, but it it did not magically appear there. It is important to consider how it got there. The authors wrote their paper in Microsoft Word [J&A – do you use EndNote?] using a journal template. When the paper was published the authors submitted it to ePrints [TODO: who did it? self submission? Did the journal send you the PDF and explain their licensing?]
Obviously the use of word is not discussed in the methodology. One would not expect to see a quote like this:
We wrote up our findings using Microsoft Word – using the template supplied by the Journal.
(Quotation fabricated by the author)
Unlike the Nvivo software the word processor is considered to be irrelevant to the study, and yet the expectations of what can and should be achieved for a publication are set by the tools used to write-up and disseminate research. The impact of the use of Word as a tool for eResearch reporting and potential for vastly improving it are discussed in more detail in another paper I submitted to the same conference as this one [5], so in this paper we will consider some of the particulars of this paper on Signing the school.
Hickey has commented that it is a shame that the images in used in his studies are not able to be viewed and explored at full screen resolution when published. The few images that did make it in to the paper are reproduced in the PDF at only one resolution. We discuss below some further steps that the researchers cold take to broaden the scope of the publications using web technologies as well some what might be offered in the way of support to visual ethnographers in the future.
The written paper also has one other dimension that is of interest. The journal is published by Common Ground Publishing, which is also the vendor of the Common Ground Publishing system. It is implied by the journal website, and the instructions offered to authors that the article is published using said publishing system. The brochure available at time of writing for the system [6] makes some claims regarding the benefits of the system which seem to be focussed on some other use than that offered to our researchers.
The website claims the system has:
A behind-the-scenes, standards-based system for capturing metadata (creator, title, publisher, date ... and more than a thousand other alternatives), which means that a published document can go into an online bookstore, an electronic library cataloguing record or be visible through web syndication (see the description of our Technology). [7]
More substantial claims about the system are made in a paper which describes a meta-schema capable of acting as a interlanguage between disparate ontologies and schemas [8], which is the subject of a patent application .
The instructions from the template contains some advice to authors which may impact on the preservation of the document.
They are asked to not include metadata in the manuscript as this information is added to the Common Ground system.
They are asked to observe some rules with instructions such as “Make sure that you do not remove the section breaks (like the one above '[Body of Paper Begins Here]').” In fact, confusingly the only such item in square brackets in the template says: “[Body of Paper Begins Here – Delete This Heading]”.
They are subjected to some apparently arbitrary limits on what kinds of word processing features they may use, including a prohibition on using fields – a mechanism commonly used by bibliographic software to embed citation data.
All of the above might be considered a worthwhile price to pay if the publishing system used by the journal is going to add some technical value in addition to the invaluable peer review and editorial services offered by the journal. Actually, in this case, the value is difficult to determine. The omitted metadata are not present in the properties of the PDF file and the layout which has apparently been done by a Free Software component [9]is significantly worse than the authors themselves would have been able to produce using Microsoft Word.
There may be, somewhere, a version of this paper marked up in the Common Ground Markup Language, which can be transformed automatically to virtually any XML format via the patent-pending interlanguage. But if there is it is of no practical use in this case. The 'behind the scenes; standards based system for collecting metadata remains just that; behind the scenes.
So despite the initial promise of a publisher web site that talks about rich semantic markup what we are left with is typical of humanities publishing.
Our authors have:
A manuscript in Word sans metadata.
A an apparently automatically generated PDF file, with no HTML version and no way to do even simple publishing tasks such as linking to high resolution versions of their source images.
The original images stored on disk also sans metdata.
The analysis in the Nvivo software format.
A license from the publisher which permits them to deposit one copy of the journal-generated PDF version of the work at a university website, and some rights for re-use in book publications.
One proposal
In this section we will look at a coupe of options to improve on the prospects for preserving and disseminating the research data from Signing the School.
A simple technique for improving the prospects for preservation of this work, in one specific area; the all-important image data.
(For a general treatment of the opportunities for word processing documents see the companion paper to this one [submitted to the eResearch Ausralasia 2008 main conference] [5] )
Opportunities for disseminating the work outside of the academy, with the attendant risks and impacts of future research.
The all-important images: a pragmatic approach to preservation
[NOTE: This will be accompanied by screenshots and possibly screencast footage of the processes described here in action for a demonstration at the workshop.]
The article itself is mainly text, and the text is recoverable from the PDF version which is stored in the institutional repository even if it has to be re-typed. The image corpus on the other hand is not currently subject to any formal preservation strategy.
Here we outline a three part approach which can be implemented now, at the University of Southern Queensland. All of these steps are taken with a view to creating a scalable reproducible process.
Start storing the analytical 'codes' assigned to the items in the corpus as part of the image metadata and choose appropriate tools to accomplish this.
Negotiate with the ePrints repository maintainers to deposit a packaged snapshot of their research data either in the repository or some other managed data store. There is precedent for this in Australia with the ARROW project sponsored work at Monash University, where there have been trial deposits of crystallography data into the institutional repository [10].
Set up a web-portal that allows exploration of the data both for the researchers and their colleagues and for a broader public; building on work at USQ that has been sponsored by the ARROW project [11].
A proposal for ad-hoc tagging
One of the key things we hope to achieve with this corpus is to make sure that the analysis in terms of the researchers' own ontology is preserved with the photographs. To this end, we have begun inserting the coding into the images themselves, using hierarchical 'tags' which are inserted into the image files.
The approach taken is to use an established practice of embedding structured keywords into IPTC [12] metadata embedded in the EXIF [13] metadata for a image.
This approach allows for XMP [14] to be automatically generated. For example:
TODO: show a dump of XMP from a StS mockup.
Initial trials have used the open source program Digikam on an Apple MacBook Pro running both on a virtual Ubuntu Linux and natively on Mac OS X using X Windows But the approach is designed to work with other software which can manage tags (preferably with hierarchies) that are stored in IPTC metadata, with XMP compatibility promised for a future release.
Other software which use the same protocols for hierarchical tagging include Mapivi (open source, cross platform), Microsoft's Photo tools (TODO: get his installed – they have changed the name of some components) and the in-built metadata management in Windows Vista.
One of the challenges for this project is getting software that will run on the Mac OS X natively. The most obvious choice, Apple's consumer photo software iPhoto is not suitable for this kind of research as it does not store tag or caption data back into images. So far we have not been able to find a satisfactory native OS X application, so the researchers may be forced to use Windows on Mac hardware either running natively or virtually as Microsoft offer some photographic software which does appear to store metadata back into images.
This tagging approach should be extensible and reproducible for different research teams, as it involves commodity software and simple protocols. And while the tagging system is unsophisticated and un-standardised it is important to bear in mind that researchers using CAQDAS software are very frequently inventing their own coding ontologies which are not only not standardised, they are not even being preserved, and have no means of being exposed or disseminated as they are locked up in proprietary software.
In the case of this project, the coding-oriented tags would look like this:
/What/Project/Signing_The_School/MessageType/x where x is one of the codes used to classify the messages.
The researchers do not have type these long-winded tags; the software provides a tree-interface.
[TODO: screenshot of digikam's tagging interface]
A further development, which is not relevant to Signing the School in its current form, is that geographical information can be used to locate photographs or the standpoint of the photographer via geo-tagging [15], embedding geographical metadata in images. This can be accomplished via:
Cameras with embedded GPS, which as of Mid 2008 are not common.
Syncronizing a GPS track log with image timestamps.
Manually locating images on a map.
Both of the latter are becoming common in image manipulation software.
In the case of this study the researchers are protecting the identity of the schools. The researchers might choose to remove identifying geotags from images disseminated to the public.
However, there is still a location code which can be assigned to each image:
/Where/Australia/Queensland/Toowoomba/School_a
Once the images are tagged, a process that will take some time, there are two further considerations. Firstly, how to keep them for the institution, and secondly, how to develop a way for others to explore them.
[TODO: show screenshot with transcript embedded as photo caption]
If the images are to be published then current practice is to blur-out parts of the school sign which easily identify the institution, this is saved as a duplicate image.
[TODO: show screenshot of blurred iamge using Digikam. TODO: show how rights metadata can be embedded]
Bundled deposit
One possibility we are discussing with the USQ repositarian is to simply zip the image collection into a single file and deposit it alongside one of the research outputs in the ePrints software. ePrints has no formal content model for datastreams, but this would clearly associate the data with an article and future preservation efforts and migrations would find the image, complete with embedded keywords and captions.
An improvement over the simple Zip approach could be to provide an OAI-ORE resource map of the image collection with automatically extracted metadata.
Having the institution accept zip file of images into an ePrints system would be an improvement over the current support that the ETC researchers are enjoying. But it does not help to disseminate the images more widely, or open opportunities to link publications to the image corpus. It would not be appropriate to add each image as a separate item in the ePrints system, but it might be useful to be able to display them on an image portal, or even on their own web site.
The Software Research and Development group in the Digital Futures institute at USQ has been working on a repository portal which is a capable of harvesting and indexing the content of other repositories into a content registry, from which administrators can create special purpose portals.
As part of the TheOREM-ICE project the development team at USQ is developing a demonstration of repository deposit using the SWORD protocol with an OAI-ORE resource map to describe content.
The portal
The design of the metadata tagging described above supports two main uses:
Providing the metadata that will allow certain images to be separated out into their own portals.
Useful exploration across image collections using the same tagging scheme.
To support the first option, the SoF software can create a 'Signing the School' portal by showing only
Figure 1: Repository architecture showing two potential deposit paths
images tagged as belonging to the project.[TODO: Insert screenshot of the SoF admin screen showing a filtering to only show images that have a keyword /what/project/Signing+The+School]
By using using a structured tagging system, it is possible to map keyword tagging to multiple index fields. For example, Table 1 shows the mapping currently being tested with the Signing The School image data. The effectiveness of this tecnique will need to be evaluated when data from a number of related and unrelated projects are made available in the same portal.
Root tag | Mapped to index field or 'facet' |
What | Subject | Subject tag |
Where | Location | Location |
Who | Person | Person |
Table 1: Mapping from hierarchical keyword tags to index fields in a web portal
Missing links
The main main missing link here is to the analytical software. We are certainly not proposing that Digikam is a CAQDAS software package, but are recommending that our researchers encode their data directly into their images in the interests of data portability and flexibility. An alternative would be to create a process that can take the CAQDAS analysis and use it to insert coding into the images, or to change the CAQDAS software itself but by starting with generic tools we hope to produce a more broadly useful option before undertaking work that is specific to one software application.
It is worth noting here that with a proprietary software package such as Nvivo, one can make suggestions but it is not possible to rewrite the software to handle embedded metadata from images. Still, we can suggest it to the software vendor – and we will do so by forwarding a copy of this paper.
This study is image-driven, and well suited to the idea of a tagging an entire image and inserting a transcript into the caption. Marking-up parts of an image, or video file, or text file are more complex use cases, where in many cases it will not be possible to store the analysis in the source file; it is out of scope for the small investigation reported here to comment further, as noted in the conclusion, more work is needed.
Before concluding we will look briefly at some of the other options open to researchers on the open web. It would be possible for the ETC researchers to upload their images to a photo sharing service such as Flickr, where the embedded tags would allow a global community to add additional material and analyse it in new ways.
It would even be possible to crowd-source data collection, turning it over to the community, and extending the study to other communities with the obvious side-effect that anonymity is no longer going to be possible. The schools could even start a dialogue with the community by creating messages that skew the statistics in new ways, or strive to create messages that defy the tagging schemes.
In fact now that the first paper derived from this work has been published there is no guarantee that staff from the schools considered in the study will remain unaware of the research. [J&A – I'm no ethnographer (although I did get 100% for a second-year assignment for it about 20 years ago) – can you comment?]
Conclusion
[TODO: add notes about what we can and will do at USQ]
Further research sparked by this paper:
Evaluate free software options and interoperable standards.
Work with the humanities community to pressure vendors to support interoperable standards.
Look for further ways to work with commodity tools like photo editors.
[1] J. Austin and A. Hickey, “Signing the school in neoliberal times: the public pedagogy of being pedagogically public,” The International Journal of Learning, vol. 15, Jun. 2008; http://eprints.usq.edu.au/4240/.
[2] Longborough University, “CAQDAS (New Media Methods @ Loughborough)”; http://www.lboro.ac.uk/research/mmethods/research/software/caqdas.html.
[3] T. Richards and L. Richards, “The NUDIST qualitative data analysis system,” Qualitative Sociology, vol. 14, Dec. 1991, pp. 307-324; http://dx.doi.org/10.1007/BF00989643.
[4] Wikipedia contributors, “Gratis versus Libre,” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 2008; http://en.wikipedia.org/w/index.php?title=Gratis_versus_Libre&oldid=224478391.
[5] P. Sefton, “eResearch for Word users?,” 2008.
[6] Common Ground Publishing, “CGPublisherBrochureFeb05-1.pdf (application/pdf Object),” Common Ground Publisher Website, 2005; http://www.cgpublisher.com/ui/about/ui/CGPublisherBrochureFeb05-1.pdf.
[7] Common Ground Publisher, “CGPublisher - Interpersonal Computing,” 2008; http://www.cgpublisher.com/ui/about/interpersonal_computing.html.
[8] B. COPE and M. KALANTZIS, “Text-made Text,” E-Learning, vol. 1, 2004.
[9] iText Java PDF generation library; http://www.lowagie.com/iText/.
[10] A. Treloar and D. Groenewegen, “ARROW, DART and ARCHER: A Quiver Full of Research Repository and Related Projects,” Ariadne, 2007; http://www.ariadne.ac.uk/issue51/treloar-groenewegen/.
[11] C. Harboe-Ree, A. Treloar, and M. Sabto, “ARROW: Australian Research Repositories Online to the World,” 2003; http://eprint.monash.edu.au/archive/00000046/ .
[12] Wikipedia contributors, “IPTC Information Interchange Model,” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 2008; http://en.wikipedia.org/w/index.php?title=IPTC_Information_Interchange_Model&oldid=222156958.
[13] Wikipedia contributors, “Exchangeable image file format,” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 2008; http://en.wikipedia.org/w/index.php?title=Exchangeable_image_file_format&oldid=224131031.
[14] Wikipedia contributors, “Extensible Metadata Platform,” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 2008; http://en.wikipedia.org/w/index.php?title=Extensible_Metadata_Platform&oldid=224000610.
[15] Wikipedia contributors, “Geotagging,” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 2008; http://en.wikipedia.org/w/index.php?title=Geotagging&oldid=224090135.
1 For more on the use of new digital technologies in contemporary research activity, see Hickey & Austin (forthcoming 2008b)