Application of Information and Web Search Engine Techniques on Polar Dataset

The project consisted of three major tasks -

  • crawling the scientific data - Apache Nutch and TIKA were used to crawl the scientific data from NASA's sites. Near and Excat Duplicate algorithms were then implemented and applied on this data. Results were noted and presented.
  • indexing the crawled data - The crawled data in the first task was then indexed using Apache Solr. Text and Link relevancy for ranking pages was then implemented on the indexed data to produce ranked page results for search.
  • Building inferences on the knowledge base generated - All the data collected and indexed was then used to draw inferences based on locations, timeframes, keywords using D3, Banana and facetview. (Please refer the Video)

Dataset - Scientific data crawled and indexed from NASA's website
Technologies/Platforms - Python, JAVA, HTML, CSS, D3.js, Javascript
Tools - Apache tika, Apache Nutch, Apache SOLR
Duration - Jan, 2015 - May, 2015
Team Members - 5
Video -