During the past few decades, the availability of cheap computers, nano satellites and innovative telescopes have led to a treasure trove of data and information in the field of astronomy and earth sciences. Even though such a vast amount of data is available, it is often not well-structured or standardized. This makes it difficult to not only explore but also discover relevant datasets especially for amateur scientists and enthusiasts in turn creating an entry barrier for those interested in scientific data. Our solution attempts to address this problem by making it easier to discover and explore relevant data from a single access point.
We have developed a recommender system that takes a conversational approach to discover data in a huge data space. Instead of going for a tag based or keyword based search, the user can search with a non-technical natural language query like “I want data about Mars missions and climate” or “How did the universe begin? What are the earliest stars?”. The Vhere recommender parses these queries and matches them against pre-processed dataset descriptions and metadata from various sources that often are highly technical and hard to read. While this approach makes it easier for an experienced user to find exact datasets, it is also highly useful when the user only has a vague idea of the data they need and want to explore the solution space.
With this solution we hope to make scientific data more accessible to non-scientists, amateurs, students and citizen scientists.
We are a team of professional computer scientists highly interested in astronomy and earth sciences. While working on another project related to meteorite landings accessed through the NASA portal, we ourselves faced the challenge described above. This inspired us to use our expertise to address this challenge.
The Vhere recommender pre-processes all the available datasets and their metadata with the help of Natural Language Processing(NLP) and creates a machine learning model which is stored on disk. User queries are then matched against this pre-trained model to find the best match and the user is presented with the most relevant datasets.
In the future, we want to design a standardized API to make it easier for the user to access and work with the data in addition to data discovery. The search itself can be further refined by including other metadata like location, date, authors and extracting features from the data itself. We can then make use of advanced neural networks and deep learning techniques to create a multilevel recommender where the results are fine tuned based on further information from the user.
Our solution is developed entirely in Python. To parse and semantically examine text, we have extensively used Natural Language Toolkit(NLTK) which is a suite of Natural Language Processing(NLP) solutions for python. We used Google Cloud Platform(Thanks SpaceApps!) to host the solution.
The hardest part of the entire project was to download and get the data ready for analyzing. This points to a bigger problem of needing to normalize access and download to the datasets which will be tacked in a future iteration. The best part of the project was learning about NLP and advanced text analysis techniques and learning about Google Cloud platform to host the solution.
We used all publicly available data sets from CSA(https://www.asc-csa.gc.ca/eng/open-data/access-the-data.asp) to create our recommender. This step is data source agnostic and we can just as easily use NASA or JAXA data too. As long as the data set has a title and description in English, it can be easily plugged in and the machine learning model can be recomputed to include them.
We primarily used the metadata from the data sets to compute the NLP model that measures the similarity between the user query and the existing datasets.
Data from CSA : https://www.asc-csa.gc.ca/eng/open-data/access-the-data.asp
Google Cloud Platform : https://cloud.google.com/
NLTK : https://www.nltk.org/