Buy it Crash it - Space Apps Challenge

Buy it Crash it| Data Discovery for Earth Science

Team Updates

Since the official website has been locked up for the submission of the "Project" page. Our team decided to mail to [matt@spaceappschallenge.org], [web@spaceappschallenge.org] as the evidence that we have met the deadline of Oct, 12th 23:59 (local time), and paste our full content on team board. Hope these attachments will be seen by judges.

Project Demo

Demo site: http://great.edo.tools

Video: https://www.youtube.com/watch?v=WXzdypQdGqM

Project Code

https://github.com/TsungTang/Earth-Dataset-Odyssey

https://github.com/bonzoyang/buyitcrashit

Yang, Yu Chun

Project Title*

EDO (Earth Dataset Odyssey), a handy yet powerful gateway for dataset recommendation.

Provide a high-level summary of your project*

Summary

In face of the large NASA Earthdata archives, we devise a lightweight yet powerful web-based recommender ‘Earth Dataset Odyssey’ (EDO) with API support. The EDO first shows articles of recent events for users to quickly grasp the outlines and related datasets. It then makes recommendations based on the user’s browse history. EDO also takes user inputs of specific paragraphs or series of keywords for search. We employ the NLP to project datasets into the GloVe vector space so that similarity among the datasets and paragraphs can be established. We also incorporate user's browse times into the temporal filters established through NLP pattern recognition of the searched contents.

Describe how your project addresses this challenge*

How We Addressed This Challenge

What did we develop?

An efficient and lightweight recommender system for Earth datasets, empowered by NLP engine to provide a full-stack solution incorporating automatic temporal filter and the citation information for datasets.

Why is it important?

There have been more than 7,000 datasets on NASA Earthdata Search official website. Although trained experts can quickly and accurately target and obtain the datasets of their needs, it remains quite challenging for normal researchers, including the students and citizen scientists, to identify the datasets of relevance to their underlying research interests.

What does it do?

The characteristics of the EDO are summarized below.

1. Present recent blogs from Earth Observatory.

EDO automatically lists out the articles of recent events from the NASA Earth Observatory website, providing the users a quick glance through the recent concerns.

2. Recommend datasets based on user browse history of articles.

EDO recommends datasets according to the browse footprints of the user through the articles.

3. Recommend datasets for targeted paragraphs.

The search engine supports searching for targeted paragraphs up to 5000 words designated by the user.

4. Automatic time phrases retrieval for temporal filters.

When the user enters paragraphs or an article, EDO is able to automatically retrieve the time phrase and use it as a condition for a temporal filter to efficiently recommend datasets of relevance.

5. Account for user feedback.

EDO integrates the user feedback into the search mechanism by allowing the users to click for reviews.

6. Provide metadata of datasets from CRM API.

The recommended results include the metadata information about the dataset in the CRM API.

7. Recommend datasets based on the user browse history of datasets.

EDO recommends datasets according to the browse footprints of the user through the datasets.

8. Provide citation information of the datasets.

For each dataset, EDO provides a list of academic publications that cite the dataset. This provides the users with useful hints for their potential solutions.

9. Open API for paragraph matching is available.

We provide an open API so that the users can easily integrate our engine into their applications and even customize it to meet their needs.

How does it work?

EDO features a search engine and a recommender engine. Both engines are empowered by NLP technology. NLP helps EDO to better understand each dataset which is further used by the recommender engine. Time phrases can also be retrieved by NLP technology and thus allows EDO to search datasets that are temporal-related to the keywords issued by the user. For more technical details, please see the sections below.

What do we hope to achieve?

An easy-to-use and lightweight engine for data search which enables researchers to get on the field more easily and accurately when solving problems in earth sciences.

Describe how you developed your project*

How We Developed This Project

What inspired your team to choose this challenge?

As researchers, we find it difficult to have a quick start on NASA earth datasets due to the tremendous amount of datasets and unfriendly user interface. Therefore a well-designed recommendation system could provide a big relief to all researchers in the field.

What was your approach to developing this project?

The approach of this project can be broken down into the following three parts.

Firstly, we collect critical information of the datasets for our system. Metadata of datasets including title, summary, temporal extent, paper citation, and so on, are obtained via NASA CMR API. Metadata are essential to not only the front-end display but also the NLP model in EDO recommender engine. In addition, we also collect several articles for NASA Earth Observatory.

Secondly, an NLP model is built up for EDO. We represent each dataset with a feature vector. These feature vectors are the foundation for the EDO recommender engine. Built on the top of this, EDO search engine is capable of searching related datasets for given keywords and temporal constraint. Furthermore, date parser completes NLP search engine to automatically retrieve time phrases from the keywords issued by the user.

Finally, we build a web-based service featured by an user-friendly front-end and delicate UI. Our website works on both personal computers and mobile devices. This makes it a lot more efficient and easier for users to go through NASA Earth Datasets.

What tools, coding languages, hardware, software did you use to develop your project?

The essential tools for building our web service include Node js, Javascript, and Docker. TF-IDF and word embedding algorithms are used in our NLP recommendation engines.

Our website has a separate back-end and a front-end. The back-end provides APIs service, while the front-end integrates data and content presentation. We use Python Django as the back-end framework along with postgresql database, and employ vue.js as the front-end SPA (single page application). In order to customize EDO interface, our UI/UX designer provides web design drafts to the front-end engineer who developed UI components by tailwind CSS and SCSS.

EDO collects the metadata of geological datasets through the API provided by NASA Common Research Model. Titles, summaries, descriptions, and other text descriptions are integrated to create the TF-IDF feature vectors of the dataset [1]. By comparing the similarity of feature vectors, the association between datasets or recent disaster articles can be found. Once the user has browsed or saved some datasets or articles, EDO will use the last view algorithm to give different degrees of weight to the browsing behavior according to the time of occurrence. In the end, EDO recommends datasets that are considered the most helpful to the users.

Besides, EDO converts the natural language description of the dataset into word vectors, and then weights the word vector with TF-IDF. Therefore, a dataset can be represented in the word vector space through the GloVe word vector published by Stanford University [3]. Once the user gives an arbitrary sentence or paragraph to the EDO search engine, EDO can easily match the closest dataset in the vector space and recommend it to the user.

EDO provides a full-stack solution for issues. With EDO, the full picture of the issue, the datasets and the solutions are in the palm of your hand after only a few clicks. In addition to datasets recommendation, published papers that cite the datasets are also provided. With the use of the citation graph, these related papers act as potential solutions to the issue of interest.

EDO makes it much easier for the search of datasets by date. In addition to customizing date range by the users, EDO also automatically grabs time-related phrases from the keywords issued by the users. Using the NLP technology, temporal information is also available even if an entire paragraph is given as searching keywords. The Python dateparser library is used to extract temporal information from the searching keywords. And according to the extracted temporal information, EDO provides datasets that best meet user’s needs.

What problems and achievements did your team have?

EDO is a combination of intelligent algorithm and user-friendly interface.

While EDO offers a delicate UI for users, Open-API is also available for experts who are familiar with programming.

There are several vital algorithms in EDO. First, EDO searches related datasets for entered keywords or paragraphs, allowing advance settings to specify the temporal extent. Second, NLP pattern recognition is used to retrieve time phrases to filter temporal extent of datasets. Third, EDO recommends similar datasets according to user footprints. Last but not least, EDO is a full-stack recommendation system that brings potential solutions to your topic of interest by showcasing topic highlights, providing topic-related datasets, and displaying academic citations for the dataset.

To sum up, EDO has provided a complete, portable, and easy-to-use solution, allowing the users to quickly explore research interests and identify the appropriate data of related issues among the large archives of datasets.

How did you use space agency data in your project?*

How We Used Space Agency Data in This Project

By using NASA CRM (Common Research Model) API, metadata of earth datasets can be easily retrieved. Metadata of datasets including title, summary, temporal extent, paper citation, and so on, are essential to both the front-end display and NLP model in EDO recommender engine. Earth datasets are owned or managed by different space agencies. With such multiplicity, EDO can reach to more diverse public issues on Earth.

Demonstrate your solution*

Project Demo

Demo site: http://great.edo.tools

Video: https://www.youtube.com/watch?v=WXzdypQdGqM

Share your code (if applicable)

Project Code

https://github.com/TsungTang/Earth-Dataset-Odyssey

https://github.com/bonzoyang/buyitcrashit

References: List the data and resources used in your project*

Data & Resources

[1] Juan Ramos. (2003). Using TF-IDF to Determine Word Relevance in Document Queries.

[2] Suzanne L. LeMoleSteven Howard Nurenberg Joseph Thomas O'NeilPeter H. Stuntebeck. (1995). Method and system for presenting customized advertising to a user on the world wide web, US Patent, 1999 - Google Patents.

[3] J Pennington, R Socher, CD Manning. (2014). Glove: Global vectors for word representation.

[4] Chun-Tan Cheng. (2012). Exploring Why Facebook Users Press the "Like" Button.

[5] Derek J. de Solla Price. (July 30, 1965). Networks of Scientific Papers.

[6] Francesco Ricci and Lior Rokach and Bracha Shapira. (2011). Introduction to Recommender Systems Handbook.

[7] NASA Earthdata CMR Search, https://cmr.earthdata.nasa.gov/search

[8] NASA Earth Observatory, https://earthobservatory.nasa.gov/

[9] NASA Earthdata Search, https://search.earthdata.nasa.gov/search

Add some tags so we can categorize your project

#Search Engine, #Recommender System, #Earth Data, Earth Science, #Machine Learning, #Natural Language Processing

Yang, Yu Chun