Welcome to Our Data Analysis Website

This website presents the project we made for our data science course. We have scrapped each faculty's webpage from official institute websites of IISER Pune, IISER Mohali and IISER Kolkata. Then, the data of research fields and educational background of the faculty is analysed.

About Us

The team:
1. Kaushik Gupta MS20129
2. Rohit Khandhare MS20234
3. Samannay Bhuyan MS20081

Web-Scrapping

The Python package Beautiful Soup to used to scrap each webpage of the faculty (around 500 webpages). Initially, an sqlite3 database is created and a connection is established with it. Then, a connection is established with the website containing the details of the faculty. We loop through the individual webpages of the faculty and scrap the data contained in it. The scrapped text is processed according to the structure of the webpage to get the details of name, research area and educational background of the faculty. This data is saved in a tabular format in the sqlite3 database.

The goal of the project is to analyse the research area and educational qualifications. For this, all the text in the research area section is concatenated. The algorithm NLTK Rake is used to get the frequency of research fields in this text. Next, the library spaCy is used to get the name and frequency of research institutes. This data is used and analysed to plot the percentage of Indian and foreign institutes. Finally, the contribution of various types of institutes like IITs, Central Research Institutes and State Public universities is analysed.

Data Analysis and Case studies:

We present our data analysis methodologies and case studies:

1. IISER-Mohali case-study:

We scraped the section of "research area" and "research focus" sections from each IISER Mohali's faculty webpage and analysed it using some NLP algorithms like RAKE, Keybert, etc.

Keywords Table

Index	Description	Scores
1	Genome regulation	102
2	Functional Analysis	102
3	Developmental genetics	102
4	Environmental science	82
5	Quantum information	82
6	Soft condensed matter physics	57
7	Archaeology	57
8	Algebraic geometry	51
9	Particle physics	51
10	Molecular cell biology	43

The score presented here talks about the correlation between different strings which will be more discussed in the presentation.

2. IISER-Pune Case study:

We scraped the "Name", "Academic Background" and "Research Area" sections from each webpage of IISER Pune faculty. The data was again analysed using NLP algorithms like RAKE and Spacy.

Keywords Table

Index	Description	Scores
1	Analytic Number Theory	67
2	Plasmodium Epigenetics	67
3	Chromosome Biology	67
4	Experimental Particle Physics	64
5	Theoretical Biophysics	49
6	Chemical Biology	6
7	Inorganic Chemistry	49
8	Condensed Matter Physics	38
9	Computational Biophysics	31
10	Materials Science	26
11	Algebraic Geometry	21
12	Modern Indian Political Thought	20
13	Number Theory	18
14	Cosmology	18

Institutional Web Scrapping and Analysis

Welcome to Our Data Analysis Website

About Us

Web-Scrapping

Data Analysis and Case studies:

RAKE

SPACY

1. IISER-Mohali case-study:

We scraped the section of "research area" and "research focus" sections from each IISER Mohali's faculty webpage and analysed it using some NLP algorithms like RAKE, Keybert, etc.

Keywords Table

2. IISER-Pune Case study:

We scraped the "Name", "Academic Background" and "Research Area" sections from each webpage of IISER Pune faculty. The data was again analysed using NLP algorithms like RAKE and Spacy.

Keywords Table

We present the analysis for the academic background of IISER Pune faculties.