The Earth Observatory website contains many articles related to natural phenomena and its materials are used by both professionals and amateurs. Unfortunately, researchers of all levels often find it difficult to find data related to the work they have read. If, after reading the article, the user himself becomes interested in such research, he will have to spend a lot of time and effort. Indeed, despite the fact that there are many data storages on the Internet, it is not always easy for users to learn how to extract the necessary information from there.
When it comes to satellite data, one of the most popular sites for finding information is search.earthdata.nasa.gov. It houses thousands of collections (databases) of documents of all formats and types - from Excel documents with measurements of weather conditions to photographs from satellites. The number of files in the collection is estimated at millions and it is clear that even an experienced user will find it difficult to find the desired information.
That is why we have developed a system that will perform searches for the user.
When you go to a page with an article, the algorithm looks for all the relevant information on it, and converts it into a search query for a site with databases. It then returns the 10 most relevant collections and shows them to the user in the site header under the article title.
Any research starts with collecting data, and getting through this stage quickly can motivate people and save them time. Why is our work important? Many people quit a new venture very quickly when faced with difficulties. But our program allows the researcher to select data quickly, accurately and immediately start analyzing them, without being distracted by the long and hard work of information gathering.
Our team is very interested in data analysis and information retrieval, hence we've chosen a task that gives an opportunity to try ourselves in both areas.
The extension for Chrome is written in Javascript, and the data analysis is carried out on the server. We have deployed the application to Heroku.
To get data from Earth Observatory site, we used web scraping technologies and wrote a Python parser.
To get data from the text of the articles, we used the spaCy library and its pretrained models for
Named-entity-recognition. From the text, we extracted the names of platforms and instruments (satellites and devices for obtaining data), locations and research times, as well as the main keywords of the article.
After the parsing of the article ends, we start searching for relevant datasets on the site search.earthdata.nasa.gov. All site collections are cataloged into the CMR system (The Common Metadata Repositories).
These metadata records are registered, modified, discovered, and accessed through programmatic interfaces leveraging standard protocols and APIs.
All pages on the Earth Observatory site are divided into three types:
Each of them has its own set of data, for example, images have an exact geolocation, and in the articles you can see which sources the author referred to when writing it. As we explained in the previous part we scrapped this data from the page and then processed it via NLP algorithms.
Then we searched for suitable collections on the search.earthdata.nasa.gov using CMR system.
Algorithm work example:
In Figure 1, you can see one of the articles from the Earth Observatory when we download it without turning on our extension.
(Figure 1) Earth Observatory article view without extension

In the Figure 2, we have uploaded the same article again, but with the extension enabled.
As you can see, under the title of the article, there are links to collections with databases.
(Figure 2) Earth observatory article view with extension

We also provide link to the Google drive with extension work demonstration: