Data Discovery for Earth Science

Websites like the NASA Earth Observatory showcase the many uses of satellite data to highlight interesting natural events. International partner instruments on NASA satellites such as Japan’s ASTER instrument and Canada’s MOPITT instrument, both onboard the Terra satellite, are also included as part of the Observatory. This challenge will ask you to devise a tool or technique to guide users to relevant datasets to study specific events.

DREO (Datasets Recommender for Earth Observatory)

Summary

We have developed Chrome extension that can explore articles from Earth Observatory web-site and provide relevant datasets (information collections) from Earthdata Search site.

How We Addressed This Challenge

LINK TO THE PRESENTATION


The Earth Observatory website contains many articles related to natural phenomena and its materials are used by both professionals and amateurs. Unfortunately, researchers of all levels often find it difficult to find data related to the work they have read. If, after reading the article, the user himself becomes interested in such research, he will have to spend a lot of time and effort. Indeed, despite the fact that there are many data storages on the Internet, it is not always easy for users to learn how to extract the necessary information from there.


When it comes to satellite data, one of the most popular sites for finding information is search.earthdata.nasa.gov. It houses thousands of collections (databases) of documents of all formats and types - from Excel documents with measurements of weather conditions to photographs from satellites. The number of files in the collection is estimated at millions and it is clear that even an experienced user will find it difficult to find the desired information.


That is why we have developed a system that will perform searches for the user.

When you go to a page with an article, the algorithm looks for all the relevant information on it, and converts it into a search query for a site with databases. It then returns the 10 most relevant collections and shows them to the user in the site header under the article title.


Any research starts with collecting data, and getting through this stage quickly can motivate people and save them time. Why is our work important? Many people quit a new venture very quickly when faced with difficulties. But our program allows the researcher to select data quickly, accurately and immediately start analyzing them, without being distracted by the long and hard work of information gathering.

How We Developed This Project

Our team is very interested in data analysis and information retrieval, hence we've chosen a task that gives an opportunity to try ourselves in both areas.


The extension for Chrome is written in Javascript, and the data analysis is carried out on the server. We have deployed the application to Heroku.


To get data from Earth Observatory site, we used web scraping technologies and wrote a Python parser.

To get data from the text of the articles, we used the spaCy library and its pretrained models for

Named-entity-recognition. From the text, we extracted the names of platforms and instruments (satellites and devices for obtaining data), locations and research times, as well as the main keywords of the article.

After the parsing of the article ends, we start searching for relevant datasets on the site search.earthdata.nasa.gov. All site collections are cataloged into the CMR system (The Common Metadata Repositories).


These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.

How We Used Space Agency Data in This Project

All pages on the Earth Observatory site are divided into three types:






Each of them has its own set of data, for example, images have an exact geolocation, and in the articles you can see which sources the author referred to when writing it. As we explained in the previous part we scrapped this data from the page and then processed it via NLP algorithms.


Then we searched for suitable collections on the search.earthdata.nasa.gov using CMR system.

Project Demo

LINK TO THE PRESENTATION


Algorithm work example:

In Figure 1, you can see one of the articles from the Earth Observatory when we download it without turning on our extension.


(Figure 1) Earth Observatory article view without extension



In the Figure 2, we have uploaded the same article again, but with the extension enabled.

As you can see, under the title of the article, there are links to collections with databases.


 (Figure 2) Earth observatory article view with extension



We also provide link to the Google drive with extension work demonstration:

Link to Google drive


Code on the Github

Data & Resources
  1. NASA Earth Observatory: https://earthobservatory.nasa.gov/
  2. NASA Earthdata search: https://search.earthdata.nasa.gov/search
  3. The Common Metadata Repository: https://cmr.earthdata.nasa.gov/search
  4. SpaCy library for Python: https://spacy.io/
Tags
#datasets, #nasa, #satellites, #research, #information retrieval, #nlp, #earth
Judging
This project was submitted for consideration during the Space Apps Judging process.