Institutional Web Scrapping and Analysis

Welcome to Our Data Analysis Website

This website presents the project we made for our data science course. We have scrapped each faculty's webpage from official institute websites of IISER Pune, IISER Mohali and IISER Kolkata. Then, the data of research fields and educational background of the faculty is analysed.

About Us

The team:
1. Kaushik Gupta MS20129
2. Rohit Khandhare MS20234
3. Samannay Bhuyan MS20081

Web-Scrapping

The Python package Beautiful Soup to used to scrap each webpage of the faculty (around 500 webpages). Initially, an sqlite3 database is created and a connection is established with it. Then, a connection is established with the website containing the details of the faculty. We loop through the individual webpages of the faculty and scrap the data contained in it. The scrapped text is processed according to the structure of the webpage to get the details of name, research area and educational background of the faculty. This data is saved in a tabular format in the sqlite3 database.


The goal of the project is to analyse the research area and educational qualifications. For this, all the text in the research area section is concatenated. The algorithm NLTK Rake is used to get the frequency of research fields in this text. Next, the library spaCy is used to get the name and frequency of research institutes. This data is used and analysed to plot the percentage of Indian and foreign institutes. Finally, the contribution of various types of institutes like IITs, Central Research Institutes and State Public universities is analysed.

Data Analysis and Case studies:

We present our data analysis methodologies and case studies:

RAKE

Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method which uses a list of stopwords and phrase delimiters to detect the most relevant words or phrases in a piece of text.

SPACY

spaCy is a popular open-source natural language processing (NLP) library and framework. It is used for various NLP tasks, such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. It was used to extract the name of the universities.

1. IISER-Mohali case-study:

We scraped the section of "research area" and "research focus" sections from each IISER Mohali's faculty webpage and analysed it using some NLP algorithms like RAKE, Keybert, etc.

Research

Keywords Table

Index Description Scores
1 Genome regulation 102
2 Functional Analysis 102
3 Developmental genetics 102
4 Environmental science 82
5 Quantum information 82
6 Soft condensed matter physics 57
7 Archaeology 57
8 Algebraic geometry 51
9 Particle physics 51
10 Molecular cell biology 43

The score presented here talks about the correlation between different strings which will be more discussed in the presentation.

2. IISER-Pune Case study:

We scraped the "Name", "Academic Background" and "Research Area" sections from each webpage of IISER Pune faculty. The data was again analysed using NLP algorithms like RAKE and Spacy.


Keywords Table

Index Description Scores
1 Analytic Number Theory 67
2 Plasmodium Epigenetics 67
3 Chromosome Biology 67
4 Experimental Particle Physics 64
5 Theoretical Biophysics 49
6 Chemical Biology 6
7 Inorganic Chemistry 49
8 Condensed Matter Physics 38
9 Computational Biophysics 31
10 Materials Science 26
11 Algebraic Geometry 21
12 Modern Indian Political Thought 20
13 Number Theory 18
14 Cosmology 18


We present the analysis for the academic background of IISER Pune faculties.