Beta released

Open Source Archive beta was published yesterday. You can visit it at https://osa.mamk.fi/. The default interface is used to search and inspect the public content (which is currently very limited). Once logged in, you can access the full features for information managers, researchers and archive administrators. If you would like to have a test account, just drop me an email or leave your details in a comment. In future, we will also open a public registration for test accounts. The final software will be released as open source and provided with a SaaS model if you prefer a turnkey solution.

Current emphasis on development has been on core features (access rights, archive management, ingest, description, searching and indexing) and the user interface of course. Next on the roadmap we have multiple pilot cases in which we develop more advanced features and solutions to specific problems. Here are a few highlights: batch and automated ingests, distributed workflows and discovering and visualizing the metadata and other contents for end users (both researchers and non-researchers). 

The project personnel and the control group were both satisfied. There has been lots of work to get this far. Still, the software is very much in development and we will keep on making it better, more pleasant and loaded with useful features. We would be grateful for all the feedback and critical comments. And we take the feedback and people very seriously. Weekly updates are released to fix any inconveniences and bugs there might be in the first beta. We don’t plan to release perfect software by designing it ourselves. That would be impossible. We learn and iterate, so each release will be better.

Mikko

Posted in Archive system, Software development | Tagged , , , , | Leave a comment

Less than week to beta

It’s soon time to launch beta. It will go live next Tuesday. It’s a milestone, but as we do near continuous integration and can publish minor version multiple times a week it is not that drastic change. The difference from previous alpha is huge, though. The beta is most likely going to evolve a lot during its first weeks. I will write more about the launch next week. At that time, we will also start planning first pilot cases and begin implementing them as soon as possible.

There has been a major milestone even before the beta launch. We got our local DAITSS installation working. DAITSS is our choice for the dark archive, developed by Florida Center for Library Automation (FCLA). Finding the best components and compatible software for DAITSS was quite a tricky task but the system should be very stable once installed. We will next ingest data into it and try it in action. If all goes well, we can mirror and distribute the system to keep the data even more safe.

The latest developments include unit testing and improving the access rights system and the user interface. It has been really time consuming to build a solid interface even with the feedback from our test users. Lots of assumptions made by developers are not quite there. Fine tuning each detail takes time. The key is to identify and model the use cases and processes well. Luckily, there were some great thesis works about designing the user experience for digital archives made last year and one completing this spring. If I could change one thing, I would start the interface design even earlier.

I will keep this posting short. More time for beta development, less time to write about it.

Mikko

Posted in Daitss, Software development | Tagged , , , , , , | Leave a comment

Data management and upcoming conferences

First, we would like to announce that we will be participating Archiving 2014 conference at Berlin in May. The OSA project has two papers: Flexible Data Model for Linked Objects in Digital Archives by Mikko Lampi (MAMK) and Olli Alm (ELKA – Central Archives of Finnish Business Records) and Micro-services Based Distributable Workflow for Digital Archives by Heikki Kurhinen (MAMK/Otavan Opisto) and Mikko Lampi (MAMK). The first paper is about the data model designed during the Capture project. The model is implemented in OSA software and is further developed until the end of the project. The main technologies behind the model are the Fedora object model and RDF. The workflow paper is about the software developed by Heikki as his bachelor thesis. It is designed to be very simple and able to be integrated with any software.

Here is the latest development news. We started implementing mass modification features for managing the object metadata before and after the ingest. Mass update is very useful for describing the batches of files before ingesting them. Adding common metadata about the owner or the origin can help the ingest and management processes. But we found that it is not that simple to modify descriptive metadata after the initial ingest. Therefore, in the upcoming beta version, the mass updates are available only for files in the workspace before ingesting. We will continue to develop the archive mass updating during the spring.

Much of the Fedora content models have been refactored to contain only the absolute minimum data required to understand and manage the objects. The information about forms and organization specific mappings and such were removed from the archive because they are only views or interpretations of the data and not the data definition itself. This will make the design more consistent and allow organizations and users to have much better customization features.

After the last posting, we found out that the current GSearch solution doesn’t support the latest Solr, which we required because of the Finnish language support. This has been resolved with a new release of GSearch. Because our Solr and Fedora are installed in separate servers, we cannot use GSearch’s reindex functionality. I did some initial testing with SSHFS and NFS for connecting the servers but this approach is not very sustainable and network or server errors can cause index desynchronization. We will develop a module to keep track on the sync status and perform reindexing as needed.

Mikko

Posted in Data management, Fedora Commons, Software development | Tagged , , , , , , , , , , , | Leave a comment

Development news: digital workspace, access rights and other improvements

It’s time for a brief weekly recap. The general focus was put on user interface and access features. Again, the overall focus was as before but we moved into the next piece of the whole.

Before releasing a public access to our beta software, we need to be sure that data is secure. We have been working with an access rights filter and role based privileges since pre-alpha versions. It is based on external LDAP and embedded in our software code itself and indexed with the data to avoid performance bottlenecks. We are also looking into adding support for external security software. These will be discussed more this spring.

 

Digital workspace is a place where you can inspect, filter and enrich files before ingesting. It can be used to trigger workflows and manage automated ingesting. Handling lots of files requires a clear interface and good way to summarize the content. Our goal is to build a pre-archive workspace (if that is a term). It should be an easy task to filter, sort and group files to quickly decided what needs to be archived from an external hard disk or a thumb drive, for instance. It can be used to monitor an FTP upload directory or a network directory. Later, we will add the ability to ingest a batch metadata in Excel, CSV or XML file. Now the goal is to create an unified user interface and add very simple ingest workflow. During pilot tests we will expand the functionality.

Now that we have Solr schemas defined, we need to create an integration between Solr and Fedora Commons. Currently, we have GSearch and Apache ActiveMQ operating as middleware. GSearch listens to a message queue, where the Fedora sends messages of the changes to XML and other files (datastreams), and sends them to Solr’s REST interface via XSLT. This will be changed once Fedora 4 is in stable beta, and we can start working with it. The Solr schemas and messaging will stay, but we are looking to replace GSearch with something simpler and more flexible.

Still, there are lots of work to be done until beta.

Mikko

Posted in Fedora Commons, Software development | Tagged , , , , , , , , , | Leave a comment

Development news: user interface, usability, search and lots of iterations

First I would like to introduce a change to the posting schedule. I’ll try to update the blog at the start of the week instead of my traditional Friday post. It would serve the project better to open a week with a fresh post and introduce some ideas rather than try to catch the attention of people oriented for the weekend.

To recap the last week, we continued the same direction as previously. Lots of effort was put on search and indexing features. We now have an understanding on how Solr handles field types for faceting, sorting and what kind of a schema is efficient. We ended up sharding the metadata and rich text contents. For most of the time preserved documents don’t change. Metadata in the other hand, can be added and enriched over time. The user interface received improvements as well. Lots of content is loaded and saved with ajax to minimize loading time and keep the experience intact. Users can keep on working while actions are performed on the server and only the status is updated to the screen. These are quite basic features in web applications but lack of them can severely harm the user experience and make the software unpleasant to use. Another UI improvement was the consolidation of multiple pages or features into intuitive entities.

As the software is firstly built for Finnish use, we integrated the standard industrial classification (2008) by Statistics Finland and another classification by Finnish Business Archive Association. They are imported as read-only contextual objects and can be used to describe the content of any organization using the system.

We continued the iterations for  improving the code quality and features for developments done during alpha stages. Lots of work still needs to be done until we reach the sustainable level. But by proceeding this way, we have months of feedback and testing of the core features. I think that the ready features would have taken considerable more time to build and would in the end result in worse quality. At least there would have been a major risk of building something not needed and prioritizing the wrong features.

Open Repositories 2014 is coming to Helsinki this year. We will be there on the Fedora tech track. Most likely I will be sharing experiences and introducing the project and the latest developments.

Mikko

Posted in Fedora Commons, Software development | Tagged , , , , , , , , | Leave a comment

Solr, data and minor updates

The week has been busy but as promised here are some insights into the latest development. The main focus has been data: Solr indexing and configuration, data stores, user interfaces for searching the data and other supporting features.

Understanding the language which documents and metadata are written is very important.  So, we had to teach Solr some Finnish. It can handle with ease widely used languages like English, but Finnish requires a dictionary and complex rules.  We already had some experience on developing search features with Solr and how real production data behaves. When we started experimenting with Solr, there were no open and free tools for the Finnish language. Of course Voikko existed but not with a license we could use in commercial services and our terms. With the new licensing and work done with the Solr Voikko plugin by the National Library’s KDK project we upgraded the language understanding of Solr to the next level. Voikko is the same language tool used with Open Office’s Finnish features. I’ve been learning how to build more optimized Solr schemas and configurations. First we went with highly dynamic declarations to avoid rebuilding indexes making the system too fixed, but now it seems that it has a negative effect on performance and certain key features like sorting and faceting by the fields. Our experiences and production data provide a good starting point but still tuning the settings and finding best analyzers, field types and such is a task not to be rushed or taken too lightly. Again KDK provides a nice reference with their public schemas (available on Github). We will publish our own schemas and configurations with the software and contribute to the Voikko Solr plugin if we need to modify it.

At the moment, we are still using Fedora 3.x and GSearch to feed Solr. GSearch takes messages sent by Fedora and transforms the to Solr documents for its REST interface. During the spring or early summer, we hope to migrate to Fedora 4 which eliminates the need for GSearch and simplifies the setup. For other data stores RDF databases and engines look very interesting. Fedora 3 ships with Mulgara and we will use it until the migration. Apache Jena looks like an interesting alternative but we are still in discovery mode with that.

The search user interface is becoming more and more polished. We are working with faceting at the moment. Lots of the features as well as overall look and feel is from benchmarking and from active partners and their users. Bug fixes are done on daily basis and other small improvements. We embedded the official standard industrial classification by Statistics Finland to speed up the ingest and describe process.

More updates coming next week.

Mikko

Posted in Data management, Fedora Commons, Software development | Tagged , , , , , , , , , | Leave a comment

From alpha to beta

With the year 2014 the winter has finally came to Finland. In addition to that, we have development running at full speed. Our target is to hit the beta release at the beginning of March. The roadmap is a combination of user feedback from earlier agile sprints and specifications made during Capture project. The main focus with the beta sprint is on the user interface, search and indexing, ingest and management of the data. All the core features are in place already since alpha release, but it required some knowledge and good understanding on how things are in the early stages of software development.

I will post some previews of the user interface later. The interface will be in English and Finnish. It can be localized and customized easily to support your desired language, regional preferences and organization policies. Effort is also put into making the interface user friendly. Since the early pre-alpha versions, we have iterated the user interface development with the designated users. Stay tuned for the interface mockups or even screenshots.

Searching and indexing will be implemented mainly with Solr. It is the de facto open source engine for this kind of content. Also Finna (nice user interface for accessing Finnish archives, libraries and museums) uses Solr. Traditionally Finnish has been somewhat tricky language but now we can have a good support and we can benefit from open source work done already. We have a working integration in between our data repository (Fedora Commons) and Solr but we need to work on indexing metadata and rich text contents and building end-user features. We will build a new kind of search and browse interface based on faceting and visualizing the data instead of just an empty search box waiting for you to know what you need to input. Of course the Google-like search box will be there also if you like it better.

The third large sprint is to ready the tools for ingest and management of the archive data. We will build a workspace that will contain pre-ingest and discovery tools, manual ingest, workflows and batch ingest as well as management of the existing data. Again, we will make it highly configurable and suitable for ingesting data of any kind; Not just documents or strictly formatted metadata. For beta and pilot phases, we will introduce only reference features but they can be extended and will be well documented.

Lastly, the simple workflow engine has been now completed and will be published as a standalone project on Github later this year. The developer, Heikki Kurhinen, wrote it for the project as his thesis work. The engine will be featured in later article as soon as we get it integrated with our main software.

I will write more on the topics introduced as our team progresses.

Mikko

Posted in Data management, Software development | Tagged , , , , , , , , , , | Leave a comment