Summary and Closing

As the year changed into 2015, Open Source Archive project was completed. However, this doesn’t mean the end of OSA as a platform and software. This post is a high level overview of what we achieved, what we released and what is coming next.

OSA was a development project in the first place. Many of the results are are packed into the OSA software but we did a few publications as well. Like all software, OSA is never really complete. Once the initial roadmap is done, we can add new features and improve the existing. There are always some tweaking, additional user interfaces and such to be developed. We are proud to declare OSA as a pilot ready platform. Surely there could be a couple of bugs and missing features but that’s the nature of software. The final release of OSA project is aimed to be further developed into a production ready software or adopted as a platform for any digital archive or repository project.

Features

  • Content agnostic digital archive and repository platform
  • Premade models for common objects (document, audio, moving image, picture etc.)
  • Ingest, management and distribution of digital contents (files and/or metadata)
  • Full-text and natural search
  • Flexible and granular access management
  • Linked data and ontology support
  • SaaS and multitenancy support
  • Completely customizable user interface, content models and preservation policies
  • And lots more

Publications

In addition, six bachelor thesis were made during the project

The software developed during the project is made available as open source on Github. There is also compact documentation and example configurations to get you started. We also provide a working demo for fast evaluation.

Github repository: https://github.com/mikkeliamk
Demo site: https://osa.mamk.fi

As the project is now completed, it is likely that this blog is not updated anymore. We could use it to communicate future developments but we haven’t decided the channels yet. The Github repository and the demo site will be kept updated with the latest releases, documentation and contact information.

Thank you for following these posts. If you wish to stay in touch or ask anything, just drop an email to osa@mamk.fi or mikko.lampi@mamk.fi.

Advertisement
Posted in Archive system, Digital archiving, Project Information, Publications, Software development | Tagged , , , , , , , , , , , | Leave a comment

Towards open and sustainable information

We are proud to announce an upcoming seminar Towards Open and Sustainable Information 19-20.11.2014 in Mikkeli. The full program and registration are available here.

It’s a free two day seminar on topics such as:

  • Finnish national data exchange layer and Estonian X-Road solution
  • Digital preservation and archiving
  • Digital information management and standards
  • Open source and sustainable development
  • My data and human centric open data

We have an excellent list of high profile speakers from international and national organizations such as Ministry of Finance, Estonian Ministry of Economic Affairs and Communications, Ministry of Education and Culture, University of Eastern Finland, University of Hull, Plymouth University, Open Knowledge Finland, National Archives of Finland, National Archives of Sweden and others.

The seminar will be bilingual: the first day will be in English and covers international topics. The second day will be in Finnish and focuses more on national level topics. During both seminar days there will be exhibition by partners, private companies, organizations and projects operating in the field of the seminar topics. In the evening of the first day we will host a dinner for our speakers and guests. The details will be announced later but there will be food and networking opportunities.

The seminar is hosted by Open Source Archive (OSA) and SOTU-sähkö projects. As said earlier, the seminar will be free of charge and we will provide you with coffee and soup lunch. The evening event will be paid by the participants themselves. The project publications will be available during the seminar. The publications provides detailed articles by the speakers and the project topics.

As a last announcement, I would like to note that we will do our best to update the blog more often. As often with projects, the schedule tightens towards the end. We have lots of development going on as well as publications and events.

If you didn’t already check out the seminar page, please do it now: http://www.mamk.fi/sotu-osa

Posted in Data management, Digital archiving, Publications, Seminar | Tagged , , , , , , , , , , , , , , , , | Leave a comment

Workflow engine under way

We started integrating a workflow engine into OSA-application last week. The workflow engine is developed as bachelor’s thesis by Heikki Kurhinen. User interfaces for the batch ingest has been done already for the beta version, but the functionality has been missing.
We are going to develop several micro services for OSA-specific workflows. These workflows will generate metadata automatically, before we ingest documents into our archiving system. Automatically generated metadata during workflows execution is stored temporarily into Mongo database. The aim is to bring automation to the ingest process. Users can ingest several files at the same time and the content will be handled
at the same way during the ingest process. Regardless of ingest style (manually or batch ingest using the workflow engine) the last step of our ingest process is still XML schema validation, that is already available in the current version. We have determined schemas for all data types to guarantee the minimum requirements of the Capture-datastream. Schemas are stored in the content models in Fedora Commons Repository.

At the same time of developing new features into OSA-application, we get the users feedback of the current OSA-application version running in https://osa.mamk.fi.
Of course we are testing OSA-application ourselves, but we can not reach the real-life use cases by testing newly created functionalities mainly. Feedback from project partners is very essential for keeping development roadmap and users real requirements in balance.

Posted in Uncategorized | Tagged , | Leave a comment

Beta released

Open Source Archive beta was published yesterday. You can visit it at https://osa.mamk.fi/. The default interface is used to search and inspect the public content (which is currently very limited). Once logged in, you can access the full features for information managers, researchers and archive administrators. If you would like to have a test account, just drop me an email or leave your details in a comment. In future, we will also open a public registration for test accounts. The final software will be released as open source and provided with a SaaS model if you prefer a turnkey solution.

Current emphasis on development has been on core features (access rights, archive management, ingest, description, searching and indexing) and the user interface of course. Next on the roadmap we have multiple pilot cases in which we develop more advanced features and solutions to specific problems. Here are a few highlights: batch and automated ingests, distributed workflows and discovering and visualizing the metadata and other contents for end users (both researchers and non-researchers). 

The project personnel and the control group were both satisfied. There has been lots of work to get this far. Still, the software is very much in development and we will keep on making it better, more pleasant and loaded with useful features. We would be grateful for all the feedback and critical comments. And we take the feedback and people very seriously. Weekly updates are released to fix any inconveniences and bugs there might be in the first beta. We don’t plan to release perfect software by designing it ourselves. That would be impossible. We learn and iterate, so each release will be better.

Mikko

Posted in Archive system, Software development | Tagged , , , , | Leave a comment

Less than week to beta

It’s soon time to launch beta. It will go live next Tuesday. It’s a milestone, but as we do near continuous integration and can publish minor version multiple times a week it is not that drastic change. The difference from previous alpha is huge, though. The beta is most likely going to evolve a lot during its first weeks. I will write more about the launch next week. At that time, we will also start planning first pilot cases and begin implementing them as soon as possible.

There has been a major milestone even before the beta launch. We got our local DAITSS installation working. DAITSS is our choice for the dark archive, developed by Florida Center for Library Automation (FCLA). Finding the best components and compatible software for DAITSS was quite a tricky task but the system should be very stable once installed. We will next ingest data into it and try it in action. If all goes well, we can mirror and distribute the system to keep the data even more safe.

The latest developments include unit testing and improving the access rights system and the user interface. It has been really time consuming to build a solid interface even with the feedback from our test users. Lots of assumptions made by developers are not quite there. Fine tuning each detail takes time. The key is to identify and model the use cases and processes well. Luckily, there were some great thesis works about designing the user experience for digital archives made last year and one completing this spring. If I could change one thing, I would start the interface design even earlier.

I will keep this posting short. More time for beta development, less time to write about it.

Mikko

Posted in Daitss, Software development | Tagged , , , , , , | Leave a comment

Data management and upcoming conferences

First, we would like to announce that we will be participating Archiving 2014 conference at Berlin in May. The OSA project has two papers: Flexible Data Model for Linked Objects in Digital Archives by Mikko Lampi (MAMK) and Olli Alm (ELKA – Central Archives of Finnish Business Records) and Micro-services Based Distributable Workflow for Digital Archives by Heikki Kurhinen (MAMK/Otavan Opisto) and Mikko Lampi (MAMK). The first paper is about the data model designed during the Capture project. The model is implemented in OSA software and is further developed until the end of the project. The main technologies behind the model are the Fedora object model and RDF. The workflow paper is about the software developed by Heikki as his bachelor thesis. It is designed to be very simple and able to be integrated with any software.

Here is the latest development news. We started implementing mass modification features for managing the object metadata before and after the ingest. Mass update is very useful for describing the batches of files before ingesting them. Adding common metadata about the owner or the origin can help the ingest and management processes. But we found that it is not that simple to modify descriptive metadata after the initial ingest. Therefore, in the upcoming beta version, the mass updates are available only for files in the workspace before ingesting. We will continue to develop the archive mass updating during the spring.

Much of the Fedora content models have been refactored to contain only the absolute minimum data required to understand and manage the objects. The information about forms and organization specific mappings and such were removed from the archive because they are only views or interpretations of the data and not the data definition itself. This will make the design more consistent and allow organizations and users to have much better customization features.

After the last posting, we found out that the current GSearch solution doesn’t support the latest Solr, which we required because of the Finnish language support. This has been resolved with a new release of GSearch. Because our Solr and Fedora are installed in separate servers, we cannot use GSearch’s reindex functionality. I did some initial testing with SSHFS and NFS for connecting the servers but this approach is not very sustainable and network or server errors can cause index desynchronization. We will develop a module to keep track on the sync status and perform reindexing as needed.

Mikko

Posted in Data management, Fedora Commons, Software development | Tagged , , , , , , , , , , , | Leave a comment

Development news: digital workspace, access rights and other improvements

It’s time for a brief weekly recap. The general focus was put on user interface and access features. Again, the overall focus was as before but we moved into the next piece of the whole.

Before releasing a public access to our beta software, we need to be sure that data is secure. We have been working with an access rights filter and role based privileges since pre-alpha versions. It is based on external LDAP and embedded in our software code itself and indexed with the data to avoid performance bottlenecks. We are also looking into adding support for external security software. These will be discussed more this spring.

 

Digital workspace is a place where you can inspect, filter and enrich files before ingesting. It can be used to trigger workflows and manage automated ingesting. Handling lots of files requires a clear interface and good way to summarize the content. Our goal is to build a pre-archive workspace (if that is a term). It should be an easy task to filter, sort and group files to quickly decided what needs to be archived from an external hard disk or a thumb drive, for instance. It can be used to monitor an FTP upload directory or a network directory. Later, we will add the ability to ingest a batch metadata in Excel, CSV or XML file. Now the goal is to create an unified user interface and add very simple ingest workflow. During pilot tests we will expand the functionality.

Now that we have Solr schemas defined, we need to create an integration between Solr and Fedora Commons. Currently, we have GSearch and Apache ActiveMQ operating as middleware. GSearch listens to a message queue, where the Fedora sends messages of the changes to XML and other files (datastreams), and sends them to Solr’s REST interface via XSLT. This will be changed once Fedora 4 is in stable beta, and we can start working with it. The Solr schemas and messaging will stay, but we are looking to replace GSearch with something simpler and more flexible.

Still, there are lots of work to be done until beta.

Mikko

Posted in Fedora Commons, Software development | Tagged , , , , , , , , , | Leave a comment

Development news: user interface, usability, search and lots of iterations

First I would like to introduce a change to the posting schedule. I’ll try to update the blog at the start of the week instead of my traditional Friday post. It would serve the project better to open a week with a fresh post and introduce some ideas rather than try to catch the attention of people oriented for the weekend.

To recap the last week, we continued the same direction as previously. Lots of effort was put on search and indexing features. We now have an understanding on how Solr handles field types for faceting, sorting and what kind of a schema is efficient. We ended up sharding the metadata and rich text contents. For most of the time preserved documents don’t change. Metadata in the other hand, can be added and enriched over time. The user interface received improvements as well. Lots of content is loaded and saved with ajax to minimize loading time and keep the experience intact. Users can keep on working while actions are performed on the server and only the status is updated to the screen. These are quite basic features in web applications but lack of them can severely harm the user experience and make the software unpleasant to use. Another UI improvement was the consolidation of multiple pages or features into intuitive entities.

As the software is firstly built for Finnish use, we integrated the standard industrial classification (2008) by Statistics Finland and another classification by Finnish Business Archive Association. They are imported as read-only contextual objects and can be used to describe the content of any organization using the system.

We continued the iterations for  improving the code quality and features for developments done during alpha stages. Lots of work still needs to be done until we reach the sustainable level. But by proceeding this way, we have months of feedback and testing of the core features. I think that the ready features would have taken considerable more time to build and would in the end result in worse quality. At least there would have been a major risk of building something not needed and prioritizing the wrong features.

Open Repositories 2014 is coming to Helsinki this year. We will be there on the Fedora tech track. Most likely I will be sharing experiences and introducing the project and the latest developments.

Mikko

Posted in Fedora Commons, Software development | Tagged , , , , , , , , | Leave a comment

Solr, data and minor updates

The week has been busy but as promised here are some insights into the latest development. The main focus has been data: Solr indexing and configuration, data stores, user interfaces for searching the data and other supporting features.

Understanding the language which documents and metadata are written is very important.  So, we had to teach Solr some Finnish. It can handle with ease widely used languages like English, but Finnish requires a dictionary and complex rules.  We already had some experience on developing search features with Solr and how real production data behaves. When we started experimenting with Solr, there were no open and free tools for the Finnish language. Of course Voikko existed but not with a license we could use in commercial services and our terms. With the new licensing and work done with the Solr Voikko plugin by the National Library’s KDK project we upgraded the language understanding of Solr to the next level. Voikko is the same language tool used with Open Office’s Finnish features. I’ve been learning how to build more optimized Solr schemas and configurations. First we went with highly dynamic declarations to avoid rebuilding indexes making the system too fixed, but now it seems that it has a negative effect on performance and certain key features like sorting and faceting by the fields. Our experiences and production data provide a good starting point but still tuning the settings and finding best analyzers, field types and such is a task not to be rushed or taken too lightly. Again KDK provides a nice reference with their public schemas (available on Github). We will publish our own schemas and configurations with the software and contribute to the Voikko Solr plugin if we need to modify it.

At the moment, we are still using Fedora 3.x and GSearch to feed Solr. GSearch takes messages sent by Fedora and transforms the to Solr documents for its REST interface. During the spring or early summer, we hope to migrate to Fedora 4 which eliminates the need for GSearch and simplifies the setup. For other data stores RDF databases and engines look very interesting. Fedora 3 ships with Mulgara and we will use it until the migration. Apache Jena looks like an interesting alternative but we are still in discovery mode with that.

The search user interface is becoming more and more polished. We are working with faceting at the moment. Lots of the features as well as overall look and feel is from benchmarking and from active partners and their users. Bug fixes are done on daily basis and other small improvements. We embedded the official standard industrial classification by Statistics Finland to speed up the ingest and describe process.

More updates coming next week.

Mikko

Posted in Data management, Fedora Commons, Software development | Tagged , , , , , , , , , | Leave a comment