Thursday, November 5, 2015

Metadata: Continuing the Conversation

Shannon Albeke got into data mapping because of fish. “I worked at the Colorado Division of Wildlife for eight years prior to getting my Ph.D. It started back in 1999. We had lots and lots of information on these little fishes of Colorado. The only way you analyze all that data was to create a database and some tools for searching within it. It turned out I had an aptitude for it, and things just went from there. Now I’m an informaticist.” Albeke creates online archives that are open, accessible, and easy to navigate.
Shannon Albeke

In a way, Albeke is a victim of the internet’s success – by now, we’re so familiar with search engines like Google that we don’t think about the planning and management that makes an algorithm such an indispensable tool. Albeke isn’t just a strong advocate for data sharing – he and his team promote good metadata habits. “Metadata” refers to data about data, like labels on files or tags on blog posts. Just like a phone book, metadata allows a search engine to reach individual datasets in a gigantic archive. When metadata is sloppy or incomplete, data is effectively “unlisted.”

What are the benefits of the “open-data” method? (And why should researchers cultivate good data habits?) Data sharing cuts down on redundancy, or experiments that are needlessly repetitive. This means that scientists waste less time and use resources more effectively. It can be hard for scientists to raise funds for large-scale studies, and using available data can make research much more efficient.

This is especially important for student researchers as Albeke explains, “One student wants to use software to process gut microbes and use their DNA to explore the fauna living in your belly. Before, she could read an article about gut fauna. But now, she can also look at the data those researchers used. She can use the same tools to ask a different question of the same dataset. Could she have done that ten years ago? Absolutely not.”

Data sharing also allows researchers working in the same study area to answer broader, more complex questions by working across disciplines. For example, an ecologist collecting data on snowfall could partner with an entomologist examining bark beetle populations in the same forest. By sharing information, these researchers might be able to better answer questions about how precipitation and weather might impact beetle outbreaks.

Albeke’s team is planning visual maps of a study area, with data sets linked to a particular location. For example, several sets of data could be grouped together as part of a “clickable” multimedia map of the Snowy Range in the Medicine Bow National Forest. Researchers could look at a geographical map and see data on any number of measurements including water flow in a stream, plant growth in the forest, or weather records like temperature or wind speed. By creating a system that allows for different sets of data to be viewed on a map, researchers can answer questions on many different levels.

One very basic example of a database map.
Image credit: Shannon Albeke

In addition, Albeke is creating data banks that thousands of people can use – as researchers and contributors. According to him, the biggest problem is ‘searchability,’ or making data legible and visible, especially across disciplines. Different fields of research use different words, even in closely-related areas like botany and biology. This means that a scientist who searches for a word related to their research might not see useful data if it’s been collected and stored by a scientist using a different set of terms. Albeke’s solution is to create search engines that can “translate” terms across disciplines. This is called “semantic searching.”

Aside from data availability, another big problem is security. Does “sharing” data mean that it’s available to everyone? Can multiple people “edit” the data, like a Wikipedia entry or group Facebook page? If the data is available on a website linked to email and password information, what if the site is hacked? What about plagiarism? What if someone deletes four years of data by mistake?

All of these questions need to be answered before data sharing can become the norm, and Albeke’s team partners with IT professionals to find ways to maximize security and flexibility.

Some of these solutions can actually add features to the program. For example, a data archive could allow scientists to track ‘visitors’ to their data, and find researchers with similar interests. In this way, people who use the research will be identified just like if one checked a book out of a library. Tracking could also allow users to network with readers and colleagues around the world, and then a data archive can become a forum where scientists can synthesize results and collaborate on questions.

As Albeke and others find ways to manage data, researchers will need to help to make data available and provide additional information so that it can be understood by others. In WyCEHG, researchers are already making data available and working with Albeke and his team to ensure scientific questions consider the big picture and use all available resources to answer complex questions about water to benefit Wyoming and our water managers.

Posted by Jess White on November 5, 2015

No comments:

Post a Comment