Summary of 2013 first half

This year has been full of development and fast progress. Unfortunately, it has meant less time to write on the blog. In this post, I’ll brief you on the current status and latest news.

Open source in content management seminar
Last week we had a very interesting seminar about open source in general and at memory organization point of view. As a keynote speaker we had Michael ‘Monty’ Widenius talking about open source and business. MySQL is an excellent example of how to make open source worth your while. The key points in his speech were community, patience and understanding the ecosystem and users. As Monty said, open source supports well human nature. In a nutshell, you need to work and pay less for getting more value. But there is also difference of being free as a beer and being free as a speech. Open source is the sustainable way of operating in our business (among others); And what Monty said was in line with our views. 

Other speakers included a representative from the Ministry of Finance, Mikael Vakkari, who told us how European Union and public goverment regulates and encourages the public sector and administration to use open source. The course is right but it takes time to get there. We heard also about experiences on building open source communities from our local partner Otavan Opisto. It confirmed what Monty had said: it takes time to get people involved and usually it can’t be taken for granted. A common method is to involve a community manager to inspire and gather people together, as well as coordinate the operations.

The director of Brages Pressarkiv, Jessica Parland-von Essen, emphasized that open source is not only about technology and software but about trust and knowledge. Anything that is open and public can be inspected and reviewed. People tend to distrust the unknown and secrets. Open source on the other hand helps to make things like licensing understandable. Anyone could (at least in theory) see how open system works. This does not mean that the data should be open or the system should be accessible to the public. A good example is Linux operating system. While the system itself is open, your personal data is yours only.

Finally, the seminar concluded with a discussion panel on the future of memory organizations and the democratization of the Internet; and whether the archives should or should not adapt. It was an interesting topic and it was quite clear that some changes are happening in the near future.

Theses
A few theses have been made or being made about the topics OSA covers. They were introduced in the previously mentioned seminar. There are two theses which studied the usability, service designs and user interfaces of the light archive system. Tytti Vuorikari studied the service concepts, service blueprints and various user profiles and Outi Hilola’s work concentrated on the data visualization and the user interfaces for finding and processing information from the archives.

This kind of research and field study is a necessity when using an agile approach. The said theses helped us to understand the needs and focus the development work on areas requiring the most attention. Methods included interviewing, tailing, benchmarking and organizing workshops. The results were then analyzed and restructured and described. Benchmarking included some related projects like National Digital Library’s Finna. Available also in English at https://www.finna.fi/

There are some technology oriented theses under work. I will write more about them in the near future. The topics include building a complete virtualized private cloud to provide either platform as a service or infrastructure as a service.

OSA roadmap
Below is presented the estimated roadmap of the project milestones.

OSA roadmap

We are currently about to release the first cycle of development builds. There will be a limited release for our project partners. Beginning from alpha or beta release we could provide access to others as well. As this is open source project, the code and other products will be available on request and eventually in Github or the like.

Currently, we have all the essential software in place. These include Fedora Commons, Solr with Fedora integration, MariaDB, LDAP and a web application to wrap them together and to provide easy access. We also have a rough user interface and initial services for ingesting and accessing objects. During the summer, we hope to advance very rapidly. The initial plan is to release a new development build for testing each week and work incrementally until we reach the alpha version at the end of August 2013 and the beta at the end of the year. At that time, we should be pilot ready.

Mikko Lampi

Posted in Archive system, Fedora Commons, Project Information, Software development, Uncategorized | Tagged , , , , , , , , | Leave a comment

Seminar: Open Source in Content Management

Our project will organise an open seminar on 7th June 2013 in the Mikkeli University of Applied Sciences Main Campus in Mikkeli. The seminar Open Source in Content Management will open at 11.15 and the venue is Mikpoli hall. For our international readers I have to confess that the seminar will be in Finnish. We consider to record the seminar and stream the speeches with the help of digital archive services of MUAS. In that case we will inform you later.
The programme consists the keynote speech by Michael Monty Widenius, the founder of MySQL and MariaDB. Mikael Vakkari of the Ministry of Finance will explain the Open Source policy and practice in the Finnish public sector and Antti Leppä of Otavan Opisto will tell the long history and experiences using Open Source in the distance and blended learning, for example.
The second theme is called Open Source and Archives. Dr. Jessica Parland-von Essen, Chief of Archives at Brages Pressarkiv, will tell first about the expectations the archives have for Open Source solutions. The second speaker Osmo Palonen, Project Manager of MUAS, will tell about the ideas behind the Project OSA. The rest of this theme will be connected in the activities done and planned in the project: Mikko Lampi, Lead Engineer of the project will report the studies that are made in the project and in general reported in this blog and also draw an overview of the architecture and Open Source solutions to be build in this project. The students and their teachers will have also an opportunity to shortly present the bachelor’s thesis’ that are in connection with the Project OSA.
At the end of the seminar will be discussion about openness of archives and expertise. In that we like to challenge the memory organisations to find out how the world is changing when the status of archivists and museum specialists as gatekeeper of information is fading away.
The programme of the seminar in Finnish can be found on MUAS website http://www.mamk.fi/ajankohtaista/tapahtumat/101/0/avoin_lahdekoohdi_aineistonhallinnassa_-seminaari_7_6.

Posted in Archive system, Data management, Digital archiving, Linux, Project Information, Software development | Tagged , , , | Leave a comment

Selecting an open source framework for web application

We have been choosing the framework for OSA web application. We had a few requirements: we want to use Java, it must be an open source application development framework, and framework should be scalable and easy to configure (no complex XML configuration).

There are several easy to use Java web frameworks. We took into consideration Stripes, Vaadin and Liferay. The idea of studying Vaadin and Liferay was to check whether we could get the benefits of building UI quickly. Struts and Spring-MVC were not considered, since those require a lot of configuration.

Liferay is an open source portal for building websites and web applications. It is described as a content management framework. It integrates well with many Java systems.  Liferay provides portlets and gadgets to develop applications. For me, Liferay was difficult to get started with. Liferay offers two editions: free Community Edition and Enterprise Edition. EE is reported being more optimized (performs better) and having the ability to report any bugs and getting it fixed. There are also extra EE-only features available.

Vaadin is an open source Java framework for building Rich Internet Applications. When working with Vaadin framework, it is possible to use Java as the only programming language without having to write HTML and CSS. Vaadin provides a library of ready-to-use user interface components. It has nice themes and a set of components precompiled in GWT. It provides a clean framework for creating your own components as well. If we choose Vaadin, we would get nice looking UI easily. I found that there is lack of Vaadin documentation.  There is a good ‘Book of Vaadin’ but that’s it.

Stripes is an open source web development framework with just a few dependencies and it is based on action based MVC pattern. I found Stripes easy to learn. A configuration is needed only in a web.xml. There is also a lot of documentation available.

We ended up starting web development with Stripes. This decision was based on the fact that we wanted to select a native web framework.  We have a lot of experience of Java Stripes development in this project. By selecting Stripes, we will have a control over every functionality. We can also re-use some old code parts done in earlier projects.

Posted in Framework, Software development | Tagged , , , , , | 3 Comments

Building a dark archive

One of the primary goals in our project is to build a dark archive. The motivation for this is the preservation of digital material. A dark archive is a repository which sole purpose is to preserve the data, not to provide services or general access. It is like a black box where you put stuff and know that it will be safe and sound. Of course, limited access is provided for the archive personnel but we’re not going to add any public interfaces or internet access.

We have chosen to build our dark archive with DAITSS digital preservation repository software (http://daitss.fcla.edu/). It is a production ready open source system built by the Florida Center for Library Automation (FCLA). The reasons for picking DAITSS were its technical architecture, production readiness and other design principles. As said in the DAITSS website:

  • Automated support for submission, ingest, archival storage, withdrawal and repository management
  • REST and micro-services based architecture
  • Enforced control for data integrity and authenticity
  • Preservation strategy and implementation based on format identification and characterization
  • Good support for text, document, image, audio and video

You can read more about DAITSS and its features here: http://daitss.fcla.edu/content/documentation

We performed initial testing with the available virtual machine during summer and fall 2012. Our experiences support these statements, and the documentation and community are top-notch. We made the decision to use DAITSS as is, and do changes by configuring it only. This way we maintain full compatibility and don’t need the technical resources required for compiling and modifying the system. We can also put more effort into other objectives like testing distributed and replaceable servers.

The system will we replicated to two geographically remote locations. We will evaluate and develop tools and methods for managing the data transport and security. It is clear that we cannot use Internet for this kind of communication. We still need to decide the levels and roles of the distribution. DAITSS offers storage pooling, there are options in Linux and the data could be replicated by our backend storage infrastructure as well.

The backend storage for archived data will be tape. The disks are used only as a temporary storage during ingest, migrations and such. The data is stored for much longer periods of time than the server hardware life cycle. Also the specifications for the hardware are quite different from your typical application server. We try to adapt by building our own servers, like Google, Facebook, Rackspace and some others do. By this we achieve a lower cost, easily expandable and replaceable environment and in total, control over the hardware. Another issue worth noticing is that not all enterprise hardware is compatible with all Linux distributions. We don’t want to limit our options by picking a corporate extension card.

We bought components for two identical servers. Both have duplicated powers, network adapters and all the basic components, but in addition have ten common hard disks attached to a disk controller. In case of breaking, you could walk in the next door PC store and buy a replacement. Expanding the farm with additional servers is not an expensive task either. We will experiment with these servers and monitor the load, reliability and such to know how we can optimize the solution. The must be a reason why this is an emerging trend.

Centos 6 will be used as the operating system. It is a robust server Linux distribution, and works well with DAITSS. If required, it could be swapped for Red Hat Enterprise Linux in case we would require a commercial support. With this kind of usage, we don’t want fast changing distribution with all the bells and whistles. DAITSS requires some relational database for system data (not the archived data). MariaDB is a good pick, because it continues the work done with MySQL but with completely open source future. Essentially, it is a drop-in replacement with better features.

At this point, tape management and access have to be done with proprietary solutions. Maybe in future this will change, but until that we will use IBM Tivoli. In Mikkeli University of Applied Sciences, we have used it for years with in production systems.

I will keep you updated on the developments and progress as we go.

Mikko Lampi

Posted in Archive system, Daitss, Dark archive, Hardware | Tagged , , , , , , , , | 4 Comments

Open Source – Open Development

Dear Readers,

My apologies being quiet so long, but there have been so many other personal developments going on. At the end of the year, I moved from city to the village and been more specialised in renovating a house project than open source. Actually, I wrote the following text at the beginning of December, but my colleague Mikko criticised this being unclear. Finally, I had time to edit the text and  I post this with some updates.

I was sitting some moths ago in a seminar called OPEN (in Finnish AVOIN). This event was organised  by SADe – Action Programme on eServices and eDemocracy (http://www.vm.fi/vm/en/05_projects/03_sade/index.jsp). The programme is part of the public sector ICT progress by the Ministry of Finance in Finland. The seminar was promoting the eight member projects which are using Open Source principles as well as Open Data initiative.

Most of the 8 projects are in a way regular developments within the public administration and organised traditionally, nevertheless there were some cases that waked interest in. Learner’s Service Framework will collect all certificates and diplomas as well as study plans into a Study Path of an individual during his/her years in compulsory education, vocational until the higher education as well as the continuous education, years and decades later. The Framework will also be used as a tool to create applications to the universities. When the service is ready, a citizen is able to provide the information in electronic format for the employers as well.

Even when the project itself was not a completely new idea – I tried years ago get funding for an  idea of similar kind. Still, the project has something to take another look. The project uses agile software development where one of the key issues is to get the users and software developers working together. In this case, the participants are working together in the same rooms to ensure that the feedback from the users is immediately available for the developers. The project is using Scrum methodology.

For me, the most important message was: Finally someone in the public sector has understood and is able to demand a new method. The software development is not allowed to be a separated task that can be done at the best far from the end-customer and users. Instead, software and IT is just a part of service and process development; it should be open to the users and hopefully also those who will use the service – in this case the students.

Open Source Archive has to take contact and discuss the requirements with all stakeholders: the archivists know how to manage information, the users – researchers and public – know how the material can be utilized.

Posted in Software development | Tagged , , | Leave a comment

Blueprinting Service Oriented Archive

In this post, I will present the blueprint for our service oriented archive. I will call it a light archive, because I hope the naming convention will create some contrast in between this system and a dark archive, which I’m going to talk about in later posts this spring.

A light archive is all about providing various users and groups with value and services, not just the raw archived data or records. Common services include search and discover features, data ingest, access and distribution of metadata, multimedia streams, downloadable copies and such. To achieve this the system itself must be like a modular ecosystem of evolving software, where each piece contributes to the system. A modular system can be upgraded and developed piece by piece, new features can be added and obsolete interfaces reworked, among other advantages. It’s not reasonable to develop the whole system in-house, and of course there is absolutely no point in reinventing the wheel. Reusing and integrating software, and contributing it back to the community creates an ecosystem that can produce high quality software. That is how we think our light archive should be built.

Solution overview
We’re building our solution in three layers. Sadly, not all of it can be made with open source. The main layers are storage, infrastructure and software. Below is a high-level overview of the solution.

Highlevel overview

A high-level overview of OSA solution.

Software is the most important layer. This includes the actual light archive software and all its components (like databases, application servers, subsystems). I will talk this in detail later on. The software layer is installed into Linux servers. Though you could most likely implement it on BSD or even on Windows based servers too. These servers can be physical or virtual. We’ve chosen virtual Centos and Ubuntu Linux distributions.

Infrastructure includes a hardware platform; physical or virtual. We built a virtual cluster with OpenNebula. There is more information on that in my previous posts. Briefly, the main idea is to have a scalable and reliable platform we can extend later based on the actual need. We can deploy our development and testing environment now and add more resources under the hood as we go. We can also relocate the virtual servers in more powerful environment without changes to the payload.

If you’re interested in the tech stuff, here are a few details. We run Centos 6.3 min as virtual hosts. It has a good support for both hardware and software. And there it is compatible with Red Hat if you require commercial support. Virtualization is based on KVM/QEMU with OpenNebula for cluster and cloud features, and Sunstone for management.

The storage technology is pretty much proprietary in our project, IBM and Oracle mostly. We need to use commercial software for the time being. One goal for the project is to find out if there is enough interest for making storage vendor independent and open source. Please let us know if you’re interested or have similar goals or a project.

Light archive architecture
As a base architecture, we use SOA (Service Oriented Architecture). It allows us to rapidly add and integrate applications to the system by using web services like REST and SOAP, and message brokers, XML or the like. Additional benefit is that we don’t need to modify the original applications and so maintain compatibility with the product. We are either not tied to just one technology. Though we have chosen Java as the main technology, nothing prevents us from adding, for instance, a Python based service and using a standard REST interface to communicate with it.

Here is the design this far. It is very basic but will be documented and planned as we go. This is an agile project, so we do prototyping and incremental development rather than design and write documents.

Light archive architecture

Light archive architecture.

Fedora Commons is the core software. It is an excellent mature open source project with hundreds of implementations and multiple spin-offs (e.g. Dspace). We chose Fedora because of its extendable nature and the content modeling capabilities. And the majority of the required functionality is already implemented and well tested in it. There are REST and SOAP APIs to work with in Fedora.

Data is stored in databases, in addition to OAIS packages. MariaDB is a MySQL replacement with more open source aligned future. Its upcoming release has an integration with Apache Cassandra. NoSQL is an interesting technology and we’re going to evaluate how it would work for storing archive records. The OAIS packages are going to be stored to the tape library. Previews and use copies of files are stored in a disk array. For this we need to have an abstraction layer for storage.

Search and discover features are provided by Solr, a well-known Lucene based search engine. Its features include full-text search, faceting, clustering, real-time indexing, document handling, geospatial search etc. There is a support for Solr in Fedora and most databases, including Cassandra with the Solandra project.

Analytics and reporting are a business requirement for managing the archive system. The light archive can host multiple organizations’ repositories, so the archive service provider should be able to track the usage. Of course, for the metadata, traditional big data tools could be useful for adding value to it. We’re also looking for ways to visualize the metadata, relationships and other information. Archives have huge potential for valuable and undiscovered information.

Media services, like streaming video and audio, are a must. We need to be able to let users experience the archived audiovisual data. It can be limited to identified and logged in users, paying customers, researchers or anything. But the data doesn’t provide much value if kept hidden in a dark archive. This is something that digitization companies could be interested in.

We will build an API on top of Fedora and any custom services we might develop. The API will streamline the access and add a security layer. It can also be used to publish public features online. Most likely it will be a REST API, but it’s not yet decided.

The workflow engine is an another critical component. It will handle many system processes like ingest, migration, disposal and any batch processing. Micro-services pattern is a good solution for workflows, as seen in Archivematica and DAITSS.

Finally, we need a client which consumes the web services. We will implement a web based client in this project, but it could be a desktop client or another repository as well. We have not yet decided what framework or project we’re going to use for building our client. Some Fedora projects use CMS software. Islandora for example uses Drupal as a front end. We could use our own in-house Yksa or some well-known Java platform like Liferay. We’re open to suggestions.

Next steps
Our goal is to have the working installations of all the major software during the first couple of months of this year. We need to also identify and plan the integrations.

Fedora content and service modeling is also a critical task. We’re collecting test data to use for testing and modeling work. It can also be used for testing the performance of the interfaces we’re coding. This could be a topic for a future post. As people have told me, the power and weakness of Fedora is it’s framework-like nature. It is possible and required to define what you want to do with it. While some systems dictate how to use it.

A web client is a must have for agile development. Without a user interface, we can easily get lost developing functionality without being able to test drive them. It is even more important that our partners and future users of the system can test it and tell us what they think as early as possible.

Mikko Lampi

Posted in Archive system, Fedora Commons, Software development | Tagged , , , , , , , , , , , , , , , | Leave a comment

Lessons Learned from OSA Platform

It has been quite a while since the last post. We have been busy testing and configuring the server platform and doing various groundwork for the project. But there has been good progress. We have a new developer starting soon and an operational server environment.

Last time I wrote that we got Ubuntu to boot from SAN. It was perhaps too early for celebration. Ubuntu did boot, yes. But without the multipath setup and thus without failover functionality. Also, only one of the blades would be able to write to the disk array even each did recognize it and mapped volumes.

Lessons learned
There was absolutely no clear information on how to set up multipath. We contacted the IBM support, talk in the forums, ask around people working with Linux, read various manuals and tried several distributions with better built-in support (e.g. RHEL, Centos). Some sources said that there was no need for any additional configuration while some said that a device specific instructions had to be configured. We tried them all. In the end, the solution was simple. By default the disk array controllers (which were quite old) used a deprecated driver (RDAC) which refused to work correctly. There was a firmware update for the controllers but we couldn’t upgrade it, because there were other blades running in the same environment which couldn’t be upgraded at that time. However, there was a small switch deep in the configuration software which managed the driver to be used with the specific volume. We changed if from default (RDAC) to Linux (MPIO). We also updated the volume mapping from the default LUN 0 to LUN 1, because the default value is known to cause boot issues.

There was no need for a manual multipath configuration. Any current major Linux installation can recognize the SAN volumes and create the correct configuration on the fly and enable multipath modules for the kernel. We tested with Centos 6.3 64-bit and Ubuntu Server 12.04 LTS 64-bit. Centos and RHEL installation is more advanced. It provides a graphical menu which displays a multipath device with device nodes (mapped volumes). The installation after this step was like any other. Nice and easy. Ubuntu also discovered the mapped volumes, but as two drives. We tried partitioning and installing it in the first one. It seems that the issue is only during install time and Ubuntu will see the changes made to the first volume as well in the second one. After the install was complete the multipath module was loaded during the boot and Ubuntu recognized the virtual multipath device instead of two attached volumes.

To get multipath configured and enabled automatically during the installation we needed to have both connections online and mapped for the volume. The first advice in the manual and blogs was to use only one. However, it is a great advantage not having to configure the setup by hand. Always try first with a complete SAN environment and only then try manually with a reduced configuration, if there is a need.

Benchmark results
I also promised some benchmark results. I used hdparm utility to get an approximate overview. As a reference device, I had my high-end Dell laptop running virtualized Ubuntu Desktop 12.10.

sudo hdparm -Tt /dev/sda

1) Diskless blades: 4GB fibre channel, IBM DS4500 disk array.
Timing cached reads: 19554 MB in 2.00 seconds = 9787.68 MB/sec
Timing buffered disk reads: 572 MB in 3.01 seconds = 190.13 MB/sec

2) Laptop: VMware Workstation 9, dedicated 7200rpm disk for virtual machines.
Timing cached reads: 16486 MB in 2.00 seconds = 8247.70 MB/sec
Timing buffered disk reads: 128 MB in 3.02 seconds = 42.36 MB/sec

You can imagine what the speed advantage is with modern SAN hardware and 32GB connections, if this is what we get with years old technology. For development needs, we have to deal with the old storage hardware at hand.

I will continue later with writing about setting up our virtual environment and building a cloud platform for the archive software. We will also start the software development work in the beginning of 2013. Stay tuned.

Mikko Lampi

Posted in Hardware, Linux, Storage | Tagged , , , , , , , , , , , , , | 2 Comments