Digital Metadata: November 2018

Thursday 29 November 2018

World Digital Preservation Day 2018

Today is World Digital Preservation Day. It is the day when the digital preservation community around the world come together to celebrate the collections that have been preserved, the access has been maintained, and the work that is being done to preserve our digital legacy.

Organised by the Digital Preservation Coalition and supported by digital preservation networks all over the world, World Digital Preservation Day raises awareness of the strategic, cultural and technological issues which make up the digital preservation challenge. Since the first public website was launched nearly 30 years ago, there has been an explosion of digital content worldwide. This ‘born digital’ content is tomorrow’s cultural heritage, and it’s our job to ensure that we collect and preserve this digital history for future generations.

When I first heard about this day I was excited. Ever since I attended the iPRES (International Conference on Digital Preservation) conference in 2014 when it was held in Melbourne, I have been fascinated, passionate, intrigued with digital preservation. Unfortunately it isn't something that is shared with my institution. Digital preservation costs money, sometimes a lot of money. And that is something that my institution is very loathe to part with. (I should know, I've been campaigning for a new institutional repository system since 2010 with no luck, and that is peanuts compared to a preservation system).

But, in my own way, I have been steadily pushing the preservation agenda, one step at a time.

So, many people don't really understand what digital preservation actually is. Essentially it is the coordinated and ongoing set of processes and activities that ensure long-term, error-free storage of digital information with a means for retrieval and interpretation for the entire time span the information is required. Preservation isn't just digitisation, however digitisationis part of preservation.

So with that in mind, this is what we have been doing in this space for the last few years.

Institutional repository (IR)

Our IR actually does preservation reasonably well (which is surprising as not much else works with it). It produces a PREMIS datastream, which is the international standard for metadata to support the preservation of digital objects and ensure their long-term usability. If only we understood how the IR platform actually uses it..... Our IR also does versioning very well. While not strictly 'digital preservation', it is a form of preservation as each version is maintained in the system (metadata and documents). However this is where our investment into digital preservation ends. Our documents are at best stored as standard PDFs, at worst Word documents or other proprietary file formats. And we do not actively maintain the files in the system, but rather just 'believe' that they will work when we next try to download one.

Digital collections

Our institution has a number of digital research collections and the degree of "preservation-ness" for each one varies. One has PDF/A as a standard for any scanned documents, another has straight PDFs, and the third (a collection of images) has RAW, TIFF and JPG files. However, like the IR, the files are not actively maintained for fixity and access. This is a new focus area for us (and for the system that we use) and I'm hoping that it will further develop in this space in the future.

Research data

Unfortunately, beyond backup, no digital preservation activities are performed on our research data. Like musch of our research infrastructure we are working with sub-standard ad hoc systems. When we can better manage it from an administrative aspect, then hopefully we will be able to better manage the digital preservation of it.

Digital preservation is something that I get very excited about. However I'm not naive enough to realise that it can and probably is a very dry subject to many people. So, just to make it a bit more interesting, here are some websites that I think are pretty cool.

The Museum of Obsolete Media has over 500 current and obsolete physical media formats covering audio, video, film and data storage.

And the ‘Bit List’ of Digitally Endangered Species, a crowd-sourcing list of which digital materials the community things are most at risk.

But perhaps the coolest thing about World Digital Preservation Day is that it gives us a chance to have cake!

Happy World Digital Preservation Day everyone!

Saturday 24 November 2018

Just the Pure truth...

On Friday we had another demonstration from Elsevier of their research management system, Pure. We had a demonstration/sales pitch a few years ago, but at the time things were not as desperate as they are now. The research FTE has grown by over 100% in the last eight years, with journal articles alone growing by a whopping 230% in the last five years (as submitted for ERA evaluation). So, in spite of the Library having chosen a preferred new institutional repository, the Research Office decided that they knew better and organised a new demonstration of Pure.

It say that it was interesting would be an understatement. It was an informative demonstration that showed that there are so many holes in Pure that I’m surprised that it is keeping afloat!

But, let us look at the reasons why our institution is interested. The Research Office uses Research Master currently for Grants, Ethics and HDR management. They used to use it for Publications but have recently turned this module off in favour of ‘linking’ their reporting to our very flaky and unstable Access database which shadows our institutional repository (a whole other story). The HDR management is being migrated as we speak to PeopleSoft, which just leaves the Grants and Ethics. In an ideal world the institution would source a new system that would be able to cope with both the Research Office’s needs as well as the Libraries (which include research outputs, data management and digital collections), but it seems that just such a system has yet to be made. The Library is keen on signing up for ExLibris Esploro, which initially would only cater for publications however research data management and possibly grants management are on their roadmap. The Research Office is “luke warm” about Esploro (I know not why), so when the invitation came for another demonstration of Pure, I was ready and waiting to see what had changed in the last few years.
I have to say, not much. Below are some random thoughts from the demonstration (which was delivered as a webinar).

Many institutions, including University of St Andrews (UK) which was featured in the Elsevier PPT, use Pure but also still maintain a separate institutional repository. When I asked the Elsevier presenter why this would be, he responded that there may be a number of reasons including the need for a system that could handle collections (Pure doesn’t handle collections at all). As a Library we have digital archival and research collections (for example, the K'gari (Fraser Island) Research Archive and the USC Art Gallery Ephemera Collection) that are currently using the ExLibris Alma/Primo VE platforms. Although there have been and still are some teething problems to using this, on the whole it is a big improvement on our previous system, Canto Cumulus. So it would be likely that we would continue using Alma/Primo for these types of collections. However there are other research collections housed in our institutional repository that would not be able to be catered for in Pure, such as publications relating to a particular project a particular research group (especially if it were an informal group), theses, conferences that USC hosts, etc.

Reporting and analytics are always a problem with any system, and often the deal-breaker if it doesn't deliver. During the recent CAUL Research Repository Days in Melbourne it was mentioned that Pure’s reporting was “diabolical” and that some coding experience was required. Pure doesn't share direct access to the database, but provides all reporting through APIs. The API structure is the same as the Scopus API. Reporting of the backend is via APIs whereas the reporting from the front end is via the dashboard. Elsevier is currently enhancing their reporting module including building a ‘write’ API. Interestingly my advance question on output analytics (page views, downloads) was ignored and I didn’t realise until afterwards that it wasn’t answered.

In our previous demonstration some years ago, Pure had no Ethics module. Now they have a basic Ethics record, with some institutions such as Monash University using Infonetica as their main Ethics system. I do not know enough about Ethics workflows to know what we would need, however it seems that Pure will not cater for the Ethics requirements of the institution.

When I asked at the end about compliance with the ARC/NHMRC OA mandate, I expected a system such as Pure to be up to code. However I was surprised to hear that they are not compliant. Elsevier is meeting with the ARC in the near future to discuss the requirements, so who knows when it will be released into production. That being said, our current institutional repository is only about half compliant. But shhhhh!, don’t tell anyone.

Preservation is becoming increasingly important for digital data, and research information is no difference. It is something that I am very passionate about, although acknowledging that I am a novice in the field. When asked about Pure’s preservation strategy the Elsevier presenter mentioned that a history is kept of metadata records in the background (no clue as to if this is accessible to the administrator or only Elsevier staff). No versioning of documents is kept as the system is not designed for this. However I believe that Pure can plug into third party proprietary preservation systems, although this wasn’t confirmed by the presenter.

During the demonstration several years ago it was mentioned that any metadata harvested into Pure from Scopus was unable to be edited. Scopus is not the most perfect of metadata aggregators and there are often mistakes with the metadata. So to hear that the system would have a subset of records that are unable to be edited was alarming. Happy to say that this has now changed. Pure now lets you edit Scopus records, and is even considering allowing users to edit records for errors which would then feed back to the Scopus database!

Editing records is a bug-bear in our current institutional repository with many of our older records unable to be bulk edited. The Elsevier presenter said that all the records could be edited in bulk. The documentation however states that not all fields could be edited, and this was one of my advanced questions, but like the analytics question I forgot to ask about this during the demonstration.

One of the advantages of Esploro is that research data management, in particular dynamic data management plans, are on the roadmap to be developed. Pure is not going to have research data management capabilities but is going to rely on Mendeley Data for the data management tool. I haven’t heard anything about Mendeley Data, so will need to look into it to see what it’s capabilities are.

The Grants management in Pure seems to be a fairly superficial module, at least compared to the richness of the Research Master data. However, like Ethics, I do not know enough about Grants to know if this would be a suitable system. I do know that Esploro will interoperate with Pure, so if the Research Office chooses to use Pure for Grants and we end up using Esploro for our institutional repository, they will at least talk to each other.

On the whole, Pure is a system that could be used as long as you make allowances for it’s limitations - in reporting, institutional repository, collections, Ethics, Grants and research data. No system is perfect, however some systems are more perfect than others. And I fear that Pure isn’t one of them.

Thursday 1 November 2018

CAUL Research Repository Days 2019

The 2018 CAUL Research Repository Days were held in Melbourne over 29-30 October. Although there was much discussion over many different topics, the program was very much focused on interoperability between systems which is a trend that I have observed the IR community heading towards. With a well running platform, repository work is less about the 'publication' and more about how systems interact with each other.

The below is a summary of some of the themes that were of particular interest to my institution and myself.

CAUL Projects

Review of Australian Repository Infrastructure Project

Much of Day 1 was in discussion of FAIR. Drafted in 2014 but published in 2016, the FAIR principles (of Findable, Accessible, Interoperable and Reusable) are a set of 14 metrics designed to determine the level of FAIRness of an output or system. In response to this, CAUL proposed a project in 2017 to determine how improvements to repository infrastructure can be made across the sector to increase the FAIRness of Australian-funded research outputs. The final report has just been released.

The project followed seven project working groups designed to examine the current repository infrastructure, international repository infrastructure developments, repository user stories, ideal state for Australian repository infrastructure, next generation repository tools, and make recommendations for the possible "Research Australia" collection of research outputs. The first six group findings are included in the report, while the seventh, the "Research Australia" recommendations, is due at the end of 2018.

Each working group provided a report of their findings. Most were not surprising and were generally what we have known to be the case for some time. In summary (and in no particular order), they include:

Although nine institutions had new generation repository software, many of the others had ageing infrastructure that perhaps had not been able to be funded since the ASHER funding in 2007, with VITAL particularly mentioned for dropping in number
Ageing software was identified as a weakness of repository infrastructure, as was the lack of automation and identifiers
Research outputs were the most common output in IRs, followed by theses and research data. Other output types included archival collections, journal, images and course materials
Institutions numbers were almost equal in terms of having an OA policy, a statement or partial policy, or no OA policy at all
Only 5 institutions supported research activity identifiers (although they didn’t specify RAiDs in particular)
13 institutions had a digital preservation strategy for the IR content, with a further 3 developing a strategy
Most successful initiatives have stable secure funding.
Recommendation that CAUL seek consortia membership of COAR
List of general repository requirements.

The seventh group is looking at the feasibility of a "Research Australia" portal as a single-entry point to a collection of all Australian Research outputs. This is similar to the RUN proposal some years ago. Views were extremely mixed regarding this. Responses included that it is duplicating what we already have with Google Scholar and TROVE, whether it would be OA or metadata only, the quality of metadata, and questions over unique institutional requirements. Three possibilities have been proposed - upgrade TROVE to provide all necessary reporting needs, develop a new portal harvesting repositories (similar to the OpenAIRE model), or developed a shared infrastructure.

Collecting and Reporting of Article Processing Charges (APCs)

Another CAUL project currently underway is the APC project determining the cost of article processing charges for institutions. Several options are proposed. Less preferred include creating a fund code in the finance system of the institution or querying the finance system using a selection of keywords. Another less preferred option is obtaining reports from publishers or making them provide this information as part of the subscription agreement. What is likely to be proposed is a very manual method of extracting a dataset from Web of Science, Scopus and Dimensions, either by institution or nationally, run it against the unPaywall API to find which are OA publications, deduping on DOI then, using the publisher list price for APCs, determining the cost of the APC payment based on the corresponding author institution. A couple of institutions have done this calculation internally with varying results. My own use of the unPaywall API has shown it to be unreliable in terms of finding OA outputs as false positives can be returned, however it seems to be the most promising tool to date in this respect.

Retaining Rights to Research Publications

A survey of Australian university IP policies has been undertaken to identify potential barriers to the implementation of a national licence in Australia, similar to the UK-SCL licence. The key element of the UK-SCL licence is to retain the right to make the accepted manuscript of scholarly articles available publicly for non-commercial use (CC BY NC 4.0) from the moment of first publication. An embargo can be requested (by either the author or the publisher) for up to 12 months. However only 13 Australian universities have an IP policy that would be supportive of this licence. Recommended as the next step by CAUL is to approach Universities Australia for consideration and the development of guidelines for alignment of IP policies.

Statement on Open Scholarship

A final CAUL project is the Statement on Open Scholarship which is a call to action around advocacy, training, publishing, infrastructure, content acquisition and education resources. The review period ends at the end of October.

FAIR Data

Natasha Simons, ARDC, reported on an American Geophysical Union project designed to enable FAIR data. The project objectives were to look at FAIR-aligned repositories and FAIR-aligned publishers. There is a push for repositories to be the home for data rather than the supplementary section of journals. A commitment statement has been produced with a set of criteria that repositories must meet in order to enable FAIR data. (USC can meet about half of the requirements with the current infrastructure and policies).

In terms of Australian repositories, the AGU project may influence subsequent projects in other research disciplines. As publishers are moving away from data in supplementary sections of journals to data in (largely domain) repositories, trusted repositories (the Core Trust Seal) are becoming increasingly important.

Ginny Barbour, AOASG, proposed a new acronym, "PID+L" (pronounced, piddle) as the essential minimum of metadata required for research outputs to be FAIR:

ORCID
DOI for all outputs
PURL for grants

Licence (machine readable)

Note that we are unable do this with our current infrastructure.

ORCID

Simon Huggard, Chair of the ORCID Advisory Group, provided a snapshot of the ORCID Consortium in Australia. There are 41 organisations that are part of the consortium with 32 integrations completed (by 29 consortium members). The most popular system used for integration are custom integrations, followed by Symplectic, Pure, Converis, IRMA, Scholar One and ViVo. Seven institutions have done full ORCID authentication integration so that researchers can sign into ORCID using their institutional credentials. Currently there are 90K Australian researchers registered with an ORCID, up from 30K at the beginning of 2016.

The ORCID Consortium has developed a Vision 2020 which aims to have all active researchers in Australia with an ORCID, and all using their ORCID throughout the research lifecycle. The ARC and NHMRC will integrate ORCID into their grant management systems (which they have done, and which will be live in the next couple of weeks), and where possible, government agencies to draw upon ORCID data for research performance reporting and assessment.

There are challenges in integrating ORCID institutionally, most common being private profiles (early profiles were set to private by default) and synchronisation issues, particularly duplicates where metadata may be slightly different in varying source data. Another challenge is getting ORCID to be displayed in IRs. When asked about this, the ARC replied that although this is a requirement of their OA mandate, at present it is not a problem although it will be in the future.

Digital preservation

Jaye Weatherburn, University of Melbourne, gave a keynote presentation on digital preservation and the role that libraries, in particular IRs, need to play in this. Digital preservation is a series of managed activities necessary to ensure continued access to digital materials for as long as necessary. There are several reasons for looking at digital preservation - decay of storage media, rapidly advancing technology leading to obsolescence, fragility of digital materials, and protection against corruption and accidental deletion. A digital preservation strategy can be used to monitor these risks. Long term preservation however is not a 'set and forget'. It is an iterative process to ensure the life of a document is maintained. Without digital preservation there is no access to materials in the long term.

It should be noted that our IR doesn’t ‘do’ digital preservation beyond saving PDF files of outputs where available, along with metadata. The FIA collection does digital preservation slightly better, in that the PDF/A standard is used for master representations. While the Herbarium perhaps does it the best, with RAW, TIFF and JPG files being saved for each image. However, without a digital preservation system such as Rosetta, we are not so much preserving our digital data but rather just backing it up to protect against deletion.

Closely aligned with this theme of preservation is that of trustworthiness of a repository (which also includes the organisation). There are two frameworks that are commonly used for examining the trustworthiness of repositories - the Core Trust Seal, and the Audit and Certification of Trustworthy Digital Repositories based on ISO16363. Both can be self-assessed and provide a good means of documenting gaps, although the Core Trust Seal is less intensive on resourcing and time. This is something that I have been keen to do for USC since I first heard about it at the iPRES conference in 2014 and is something I will complete once a decision is made regarding a future system.

Below is a word-cloud of what attendees thought digital preservation meant to them:

Other interesting things:

Idea of incentivising scholarly communication via cryptocurrency.

Chris Berg, RMIT, opened with a keynote on blockchains as a tool to govern the creation of knowledge. Blockchains are economic infrastructure on which new forms of social organisation can be built. Chris states that academic publishing is a subset of a general problem that has afflicted publishing and the knowledge economy since the invention of the internet. The RMIT Blockchain Innovation Hub project had the idea of incentivising scholarly communication via cryptocurrency - a token to pay and reward for peer review, sharing citations, reading, etc. In terms of economic modelling, journal publishing can be viewed as a 'club'. The aim of the project was to bring transparency to the peer review process, provide digital copyright authentication and verification, and to provide incentives and rewards for the different aspects of the scholarly communication lifecycle. Enter 'JournalCoin'… Subscriptions, article processing fees and peer reviewers could be paid by JournalCoin, and rewards for such things as fast peer reviews, formatting, royalties, rankings and citations paid via JournalCoin. The journal is then the platform upon which the incentives are paid.

IRUS-UK pilot in Australia

CAVAL is currently running a project on implementing IRUS-UK in Australia. IRUS (Institutional Repository Usage Statistics) started in the UK in 2012 and sought to provide a standards-based service with auditable usage data. The aim was to reduce duplication of effort by IR managers and present a uniform set of usage data regardless of the IR platform. IRUS data is COUNTER-compliant. IRUS-UK now does this for about 140 IRs in the UK. A pilot has been underway in Australia involving University of Melbourne, Victoria University, University of Queensland, University of Sydney and Monash University to evaluate the usefulness of IRUS in Australia. Several of these institutions reported on their experience, which was largely positive. One advantage of the IRUS statistics is that they exclude 'false positive' metrics, resulting in slightly lower statistics than the native IR ones. CAVAL reported that if usage of IRUS goes ahead, maximum benefit will be realised if the majority of Australian universities participate and individual universities will be able to benchmark against each other.

Social Media campaigns

Susan Boulton, GU, provided an outline on a pilot the Library ran to promote their IR through social media. By using national/international events (such as World Malaria Day, Sustainability Week, and Dementia Month), blog posts and social media mentions were written showcasing the research that was in their IR. To prepare time was spent planning, sourcing open access content, identifying champion event owners, and preparing the social media material. These small social media events provided a significant jump in IR traffic and downloads. Another benefit was the improved relationship between researchers and the Library, as researchers can see another value-added service.