Vhere - Space Apps Challenge

Vhere| Data Discovery for Earth Science

Data Discovery for Earth Science

Websites like the NASA Earth Observatory showcase the many uses of satellite data to highlight interesting natural events. International partner instruments on NASA satellites such as Japan’s ASTER instrument and Canada’s MOPITT instrument, both onboard the Terra satellite, are also included as part of the Observatory. This challenge will ask you to devise a tool or technique to guide users to relevant datasets to study specific events.

Vhere - A conversation based recommender for data discovery and exploration

Summary

There are thousands of datasets across several data sources (NASA, CSA, etc.). Moreover, dataset titles do not convey enough information to decide if they are relevant, and reading through descriptions is hard and time consuming.To solve this problem, we built Vhere - A recommender system that not only reads and understands dataset descriptions but allows searching them with a simple non-technical natural language query. For eg. if a user asks "I want to know more about Earth's atmosphere and geography", Vhere will magically (with some science too!) match the query with the description, title and metadata of all the datasets across all the sources and find the most relevant results.

How We Addressed This Challenge

During the past few decades, the availability of cheap computers, nano satellites and innovative telescopes have led to a treasure trove of data and information in the field of astronomy and earth sciences. Even though such a vast amount of data is available, it is often not well-structured or standardized. This makes it difficult to not only explore but also discover relevant datasets especially for amateur scientists and enthusiasts in turn creating an entry barrier for those interested in scientific data. Our solution attempts to address this problem by making it easier to discover and explore relevant data from a single access point.

We have developed a recommender system that takes a conversational approach to discover data in a huge data space. Instead of going for a tag based or keyword based search, the user can search with a non-technical natural language query like “I want data about Mars missions and climate” or “How did the universe begin? What are the earliest stars?”. The Vhere recommender parses these queries and matches them against pre-processed dataset descriptions and metadata from various sources that often are highly technical and hard to read. While this approach makes it easier for an experienced user to find exact datasets, it is also highly useful when the user only has a vague idea of the data they need and want to explore the solution space.

With this solution we hope to make scientific data more accessible to non-scientists, amateurs, students and citizen scientists.

How We Developed This Project

We are a team of professional computer scientists highly interested in astronomy and earth sciences. While working on another project related to meteorite landings accessed through the NASA portal, we ourselves faced the challenge described above. This inspired us to use our expertise to address this challenge.

The Vhere recommender pre-processes all the available datasets and their metadata with the help of Natural Language Processing(NLP) and creates a machine learning model which is stored on disk. User queries are then matched against this pre-trained model to find the best match and the user is presented with the most relevant datasets.

In the future, we want to design a standardized API to make it easier for the user to access and work with the data in addition to data discovery. The search itself can be further refined by including other metadata like location, date, authors and extracting features from the data itself. We can then make use of advanced neural networks and deep learning techniques to create a multilevel recommender where the results are fine tuned based on further information from the user.

Our solution is developed entirely in Python. To parse and semantically examine text, we have extensively used Natural Language Toolkit(NLTK) which is a suite of Natural Language Processing(NLP) solutions for python. We used Google Cloud Platform(Thanks SpaceApps!) to host the solution.

The hardest part of the entire project was to download and get the data ready for analyzing. This points to a bigger problem of needing to normalize access and download to the datasets which will be tacked in a future iteration. The best part of the project was learning about NLP and advanced text analysis techniques and learning about Google Cloud platform to host the solution.

How We Used Space Agency Data in This Project

We used all publicly available data sets from CSA(https://www.asc-csa.gc.ca/eng/open-data/access-the-data.asp) to create our recommender. This step is data source agnostic and we can just as easily use NASA or JAXA data too. As long as the data set has a title and description in English, it can be easily plugged in and the machine learning model can be recomputed to include them.

We primarily used the metadata from the data sets to compute the NLP model that measures the similarity between the user query and the existing datasets.

Judging

This project was submitted for consideration during the Space Apps Judging process.

Vhere| Data Discovery for Earth Science

Data Discovery for Earth Science

Vhere - A conversation based recommender for data discovery and exploration

Summary

How We Addressed This Challenge

How We Developed This Project

How We Used Space Agency Data in This Project

Project Demo

Project Code

Data & Resources

Tags

Judging