Welcome to Our Data Analysis Website
This website presents the project we made for our data science course. We have scrapped each faculty's webpage from official institute websites of IISER Pune, IISER Mohali and IISER Kolkata. Then, the data of research fields and educational background of the faculty is analysed.
About Us
The team:
1. Kaushik Gupta MS20129
2. Rohit Khandhare MS20234
3. Samannay Bhuyan MS20081
Web-Scrapping
The Python package Beautiful Soup to used to scrap each webpage of the faculty (around 500 webpages). Initially, an sqlite3 database is created and a connection is established with it. Then, a connection is established with the website containing the details of the faculty. We loop through the individual webpages of the faculty and scrap the data contained in it. The scrapped text is processed according to the structure of the webpage to get the details of name, research area and educational background of the faculty. This data is saved in a tabular format in the sqlite3 database.
The goal of the project is to analyse the research area and educational qualifications. For this, all the text in the research area section is concatenated. The algorithm NLTK Rake is used to get the frequency of research fields in this text. Next, the library spaCy is used to get the name and frequency of research institutes. This data is used and analysed to plot the percentage of Indian and foreign institutes. Finally, the contribution of various types of institutes like IITs, Central Research Institutes and State Public universities is analysed.
Data Analysis and Case studies:
We present our data analysis methodologies and case studies:
RAKE
Rapid Automatic Keyword Extraction (RAKE) is a well-known keyword extraction method which uses a list of stopwords and phrase delimiters to detect the most relevant words or phrases in a piece of text.
SPACY
spaCy is a popular open-source natural language processing (NLP) library and framework. It is used for various NLP tasks, such as tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more. It was used to extract the name of the universities.
1. IISER-Mohali case-study:
We scraped the section of "research area" and "research focus" sections from each IISER Mohali's faculty webpage and analysed it using some NLP algorithms like RAKE, Keybert, etc.
Keywords Table
Index | Description | Scores |
---|---|---|
1 | Genome regulation | 102 |
2 | Functional Analysis | 102 |
3 | Developmental genetics | 102 |
4 | Environmental science | 82 |
5 | Quantum information | 82 |
6 | Soft condensed matter physics | 57 |
7 | Archaeology | 57 |
8 | Algebraic geometry | 51 |
9 | Particle physics | 51 |
10 | Molecular cell biology | 43 |
The score presented here talks about the correlation between different strings which will be more discussed in the presentation.
2. IISER-Pune Case study:
We scraped the "Name", "Academic Background" and "Research Area" sections from each webpage of IISER Pune faculty. The data was again analysed using NLP algorithms like RAKE and Spacy.
Keywords Table
Index | Description | Scores |
---|---|---|
1 | Analytic Number Theory | 67 |
2 | Plasmodium Epigenetics | 67 |
3 | Chromosome Biology | 67 |
4 | Experimental Particle Physics | 64 |
5 | Theoretical Biophysics | 49 |
6 | Chemical Biology | 6 |
7 | Inorganic Chemistry | 49 |
8 | Condensed Matter Physics | 38 |
9 | Computational Biophysics | 31 |
10 | Materials Science | 26 |
11 | Algebraic Geometry | 21 |
12 | Modern Indian Political Thought | 20 |
13 | Number Theory | 18 |
14 | Cosmology | 18 |