Interested in finding more details on a batch of files -for example during your pre-ingest processes or on a set of existing archived files where you are concerned about possible obsolescence?
There is one easy to use tool that does exactly that. DROID (Digital Record Object Identification) is an automatic file format identification tool from The National Archives which uses The National Archives’ PRONOM technical registry service
– Create a new directory wherever you want to keep DROID – I created a directory called droid-v6 under Program Files on Windows – you will want a new empty directory as DROID does not unzip nicely.
– Download DROID, currently the latest version is droid-6.01.zip, from http://droid.sourceforge.net/ to your new directory and unzip it.
– Check the ‘Running DROID.txt’ file for instructions. Currently DROID is started by the provided “droid.bat” file (for Windows) or “droid.sh” file for linux/Mac.
– This opens a simple and easy to use graphical interface with a pre-loaded empty profile. If you have already saved a collection of files for checking then click the Add button, navigate to the directory where you save the files and then click Start. Twelve minutes later I have a list including file types and version details for all of the 3,702 pdf files in that directory.
DROID is also available as a component in some repository software and other preservation tools such as JHOVE but if those are not available to you this is a tool that provides really useful information, simple to install, easy to use and has clear help instructions.
Interviews have now been completed. I managed to do eleven interviews – a small number due to resource and time restraints. The findings can therefore only present a snapshot. This is sufficient for the short EPIC project but we much hope that we can do more of this work in the future to inform our preservation activities. It has been a very interesting exercise.
Findings in short:
No tolerance to change was shown for the object coherence, the character content and that the content would remain readable for machines and humans. For the scientists this also applied for the structures of tables and equations.
Very little tolerance was shown to change in the document structure and colour depths in figures. For the scientists this also applied to captions of figures, tables and equations and the position of equations.
High tolerance to change was given to the appearance of font type and size, to the behaviour of urls, jump references, scripts and security mechanisms (no interviewee applied these).
The spreadsheet including the interview outcomes is available on the project webpage.
Findings will be presented in more detail in final report.
Barbara Bultmann, DSpace@Cambridge
Right, I am now handing the data over to David to be put into the Plato tool. I am sure we will revisit the primary data at a later stage. I am looking forward to reviewing it when we have a suggested preservation action plan. I wish to check how the anticipated results are in line with the expectation of the interviewees. (I hope I’ll get a chance to do this within the scope of the project anyway.)
I have just uploaded the interview template to the website. It derived from a review of significant properties literature with particular respect of the InSPECT and Planets projects and is very much aligned with the workflow integrated in the PLATO preservation planning tool. Following the conceptual model of significant properties created in the course of the Planets project the PLATO tool suggests the division between Object, Technical, Process characteristics and Costs. Whilst David is testing the JHOVE2 and DROID tools to help identify extractable properties it was my job to discuss with authors the observational properties – which aspects of the documents in the collections they felt needed to be retained in order for the documents to be still understandable and useful.
I decided to create a list of characteristics and essentially use this as the interview template prefaced with some introductory questions. The list follows the intellectual division of levels (appearance, structure, content and behaviour) with a large number of lower level criteria. The selection of criteria on the list results from an evaluation of representative sample files and from comparing our list with other sample plans shared in the PLATO tool. I also had a dummy run of our draft template with a friendly project participant who gave me useful feedback and additions to it.
I was then thinking of how to conduct the interviews and how the outcome could be most efficiently recorded. I decided to bring a number of documents to each interview: a print out of a couple of example documents from the relevant collection which allowed us to flick through, check and remind ourselves what we were actually talking about and a large print out of the list of characteristics I was going to ask them about. (I did this in the format of a mindmap on which I had some positive feedback.) We discussed how important each characteristic was, both separately and relatively, and whether there were any others that we had not taken into account.
Participants were encouraged to quantify their answers (small change acceptable, large change acceptable, no change acceptable at all and equivalent ranges for other type of questions). I recorded the interviews on a digital audio recorder but decided against doing transcriptions. Instead I created a spreadsheet listing the characteristics and quantified answers. This allows us to analyse the results easily and translate them to the Plato tool without any further intermediary steps. That’s the plan anyway…
Barbara Bultman, DSpace@Cambridge
In the process of identifying “designated communities” for the EPIC project we ran through a format review of all textual material in DSpace@Cambridge. Here is what we found:
.doc (175)/.docx (1)
The .txt files are almost exclusively licence files so of no interest in the context of this project. We then ran a detailed analysis of both .doc/.docx and pdf files and reviewed the content. Our aim was to select communities who a) deposited file types that are or might soon be at risk, and b) who are potentially available for interview, c) cover a variety of disciplines. The list of communities we came up with was still very long so we decided to concentrate on one format (.pdf) and therefore make the interview list managable for this short project. We chose .pdf not only because it is the prevalent textual format in DSpace@Cambridge but also because we had recently run into problems with the format in another context. So we are interested to find out more.
List of “designated communities” to be interviewed:
Department of Physics
Department of Materials Science and Metallurgy
Faculty of Classics
Judge Business School
Department of Pure Maths and Mathematical Statistics
Department of Modern and Medieval Languages
Department of Archaeology
Department of Plant Sciences
World Oral Literature Project
Department of Engineering
Department of Applied Maths and Theoretical Physics
It was then for me to identify individuals and schedule interviews. I decided to contact the person responsible for the deposits – this in many cases was the author him/herself or in some cases the Head of Research Unit or Department. I also added a couple of supporting staff to the list to get a slightly different perspective. We considered conducting group interviews but decided that the short time of the project would not allow for this. The idea was that having more than one person thinking on complex issues would surely open debate and may result in more detailed answers, particularly if the members had differing roles – maybe an Academic, a Librarian, a Computing Officer etc. Something to consider doing in future.
I was very pleased with the response rate to my interview requests – half the addressees responded positively without me even having to remind them. There must be some interest in preservation planning!
Working through the process of acquiring the files that we will test in Planets.
One problem so far – file names do not need to be unique in DSpace so I added a ‘file exists’ test when acquiring the 120 microsoft word files and several of them have a file name that is the same as an existing file. This is partly caused by duplication of the files across multiple different records. For now I am saving that file with a different name but need to think of a better way before starting to acquire the 3,500 pdf files – maybe a subdirectory per item ?
Also an action for me – we need to start thinking now about the metadata we need to create for each item to describe the preservation actions taken.
Today was scheduled for a test install of the Planets Suite on Windows NT
I initially tried to use the installer at http://planets-suite.sourceforge.net/download/ but ran into problems. There were error messages during the install and starting up the planets server gave a JBOSS error
WARN [org.jboss.web.tomcat.service.JBossWeb] Failed to startConnectors
LifecycleException: service.getName(): “jboss.web”; Protocol handler start failed: java.io.FileNotFoundException:
So I moved on to attempt building the application using Subversion and ANT following the instructions from Open Planets Foundation Wiki for linux installation at
These worked perfectly with some only a few minor changes to accomodate Windows
- I had Java 6 jdk and ANT already installed so I skipped those steps.
- I followed the ‘Adding paths, JAVA_HOME and ANT_HOME’ section by setting
– JAVA_HOME to D:\Program Files\Java\jdk1.6.0_21
– I had ANT in my path already (D:\apache-ant-1.8.1\bin) for a different application so did not set up ANT_HOME
- Downloaded and ran Subversion from http://subversion.apache.org/packages.html – taking a Windows version Win32Svn (32-bit client, server and bindings, MSI and ZIPs; maintained by David Darj)
- Downloaded Planets Server and Planets Suite Subversion by copying and pasting the OPF instructions. These installed to a directory D:\~ so renamed that to D:\home after download had finished.
Then over to the planets-server directory :
- Copied planets-server.properties.template file to planets-server.properties
- Changed the path to the local planets-server to
- Changed the location of the server configuration directory to
- Left address/name of the machine as ‘localhost’ as this is a test on my pc
- Left the port the server will listen on as 8080
- Left the ssl port the server will listen on as 8443
- Changed username/ password for user management database access. This was a mistake as the Derby DB setup did not work properly and I had to change back to the original values
- Copied the email configuration from OPF – as again this was a test implementation
Then over to the Planets-suite directory for similiar changes :
- Copied build.properties.template to build.properties
- Changed the path where the installed IF framework is located to
- Changed the location of the IF and service config directory to if_server.conf=D:/home/planets/planets_server/server/default/conf/planets
- Changed thdirectory that stores the Data Registry configuration to if_server.doms.config.dir=D:/home/planets/planets-server-compiled/server/default/data/planets/dom-config
- Left the rest as the defaults
Then ran the ANT builds
In Planets-server …
- ant deploy:planets-server
- ant create:dbs
(The documentation said to expect a number of the sql statements to fail)
- ant deploy:framework
In Planets-suite …
Then started the Planets Server by changing to the bin directory where the complied version had been installed.
cd D:\home\planets\planets-server \bin
and running run.bat
To use Planets point your browser at http://localhost:8080/ and login with username user and password user.
So overall an easy process. Now on to exploring the operational aspects