Building data collection and data lake for Jewish museums around the world
Let’s talk about museums. We all know what they are and we usually seek them out when travelling, visiting new cities. Sometimes museums will be large, housing thousands of displays. Normally, you wouldn’t ever think about how museums store their information. But it appears that this is a common big problem for all museums everywhere. So after some time spent on analysis, we can conclude that:
- Museums hold a lot of information.
- Museums hold very heterogeneous information – texts, pictures, audio, video, drawings, goods, artworks, artefacts, etc.
- Information can be highly interconnected. For example, the story of Van Gogh’s life will contain information about his pictures though his pictures are a separate display category.
- Museums want to exchange information with each other.
- There are government regulations (as in the EU) which require all museums to provide all their information in a standard format to a specific central data warehouse.
- Museums store their data in different formats (Marc21, DublinCore) and in different storages (excel sheets, raw databases, CMS systems).
There are very few major specialist aggregator companies in the world that can help museums solve these problems. And one of them, we can proudly say, is our client Jewish Heritage Network, which helps Jewish museums all over the world solve their content aggregation problems.
Together with JHN, we decided to build a centralized aggregation system. The goal of this system is to:
- Collect data from various museums. By creating multiple adapters we can fetch data in any format and from any storage just in a few minutes.
- Transform data into one standard format approved by the EU called OAI. We have already integrated with it, so it’s completely problem-free for museums.
- Provide powerful dashboards for museums so they can visually see how much data has been collected, where errors have occured and what they were, and check the quality of their datasets.
- Provide Rest API so that museums can consume each other’s data without implementing anything themselves.
From a technical perspective, we built an automated system which acts as processing engine and storage simultaneously. To achieve this we first made a comprehensive analysis of existing Headless CMS systems. We selected Directus as an open source and actively developed a solution which acts as a backbone for us. You can read more about this here: https://directus.io/ Then we needed to build an ETL engine and we decided to move on with Dagster. See docs here: https://dagster.readthedocs.io/en/stable/
We implemented a few plug’n’play building blocks written in Python, which is easily integrated into Dagster. So when we have another museum we choose a module to read data (let’s say to read from Excel), another to transform data (let’s say from Marc21 to OAI) and then ingest data into our Directus CMS.
We must also say a bit more about the CMS role. We chose Directus CMS because this is Headless CMS. This literally means it is fully functional without UI.
It has very rich and well-documented REST API so you can easily implement your own UI or you can integrate seamlessly with any other 3rd party system. And we should add that built-in UI is open source and has a lot of features, so in case you need to add one you can so with no problems.
Result and value
As a result we’ve built a full-blown ETL engine that reliably collects data in various formats. It provides dashboards which can be accessed by the museums to track what has been processed and reprocess it if needed. It exposes RESTful API for 3rd party consumers. For example, if a museum wants to exchange data with another museum there is no need to create custom integration because all the data in one format is in one place.
Moreover, while data is being read and collected it is being enriched through integration with services as IIIF (https://iiif.io/).
Nowadays, we’ve seen a big boost in the hospitality industry. A lot of people would like to sleep well, have good food and, in the end, everyone wants to be entertained. And museums in their current state cannot satisfy modern people. They should be more interactive and evolve faster. Exhibitions also need to become more user oriented. This has especially become much more important during the COVID-19 lockdown. Your potential visitors are sitting at home and still want to see something interesting: the only way to achieve this is to digitize your content, share it easily and exchange with other content owners.