Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions www/services/bibliometrix_etl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Bibliometrix Python ETL Project

## Project Overview

This project implements a Python ETL pipeline for converting heterogeneous bibliographic data into a Web of Science-like schema that can be used by Bibliometrix-Python.

The project supports both base-level and advanced-level requirements.

## Problem

Bibliographic data from different sources such as Scopus, Dimensions, PubMed, and OpenAlex have different formats and field names. Bibliometrix-Python requires a consistent schema similar to Web of Science field tags.

This project solves the problem by building a source-agnostic ETL pipeline.

## ETL Pipeline

The pipeline follows three main stages:

1. Extract
2. Transform
3. Load

### Extract

The extraction module supports:

- Local CSV/XLSX/TXT files for the base level
- OpenAlex API retrieval for the advanced level

### Transform

The transformation module converts raw source fields into standard Web of Science-like tags such as:

- TI: Title
- AU: Authors
- PY: Publication Year
- SO: Source Title
- DI: DOI
- AB: Abstract
- TC: Times Cited
- SR: Short Reference

### Load

The final standardized data is exported as CSV files in the `outputs` folder.

## Project Structure

```text
Bibliometrix_ETL_Project
├── data_raw
├── outputs
├── report
└── src
├── mappings.py
├── utils.py
├── extractors.py
├── transformer.py
├── validator.py
└── main.py

How to Run

Install required packages:
python -m pip install pandas requests openpyxl

Run the project:
python .\src\main.py

Generated Outputs

After running the project, the following files are generated:

outputs/raw_openalex_records.csv
outputs/standardized_openalex_api.csv
outputs/standardized_scopus_sample.csv
outputs/validation_report_openalex.txt
outputs/validation_report_scopus.txt

Validation

The validation module checks:

Required columns are present
No null values exist in the final DataFrame
Multi-value columns are stored correctly before CSV export

The project successfully produces standardized bibliographic data and validation reports.
4 changes: 4 additions & 0 deletions www/services/bibliometrix_etl/data_raw/scopus_sample.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Authors,Author full names,Title,Year,Source title,Volume,Issue,Page start,Page end,Cited by,DOI,Abstract,Author Keywords,Index Keywords,Affiliations,References,Document Type,Language of Original Document,EID
Smith J.; Rahman A.,"Smith, John; Rahman, Ahmed",Bibliometric Analysis of Artificial Intelligence Research,2024,Journal of Data Science,12,2,101,115,15,https://doi.org/10.1000/sample1,This study analyzes artificial intelligence research using bibliometric methods.,bibliometrics; artificial intelligence; research trends,data science; scientometrics,Nanjing University of Information Science and Technology; University of Dhaka,Reference A; Reference B; Reference C,Article,English,SCOPUS-ID-001
Chen L.; Karim M.,"Chen, Li; Karim, Mohammad",OpenAlex Data Standardization for Bibliometrix Python,2023,Scientometrics Review,8,1,55,70,9,doi:10.1000/sample2,This paper discusses data standardization challenges in Python bibliometric tools.,OpenAlex; ETL; bibliometrix; Python,metadata; data pipeline,Nanjing University of Information Science and Technology,Reference X; Reference Y,Conference Paper,English,SCOPUS-ID-002
Lee K.; Hasan R.,"Lee, Kim; Hasan, Rakib",A Source-Agnostic ETL Pipeline for Bibliographic Data,2022,International Journal of Information Systems,15,4,201,220,22,10.1000/sample3,The study proposes an ETL pipeline for heterogeneous bibliographic data sources.,ETL; bibliographic data; data transformation,information systems; metadata conversion,University of Malaya; Nanjing University of Information Science and Technology,Reference M; Reference N; Reference O,Article,English,SCOPUS-ID-003
66 changes: 66 additions & 0 deletions www/services/bibliometrix_etl/execution_log.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Execution Log

## Project Name

Bibliometrix Python ETL Project

## Objective

The objective of this project is to build a Python ETL pipeline that converts heterogeneous bibliographic records into a Web of Science-like schema compatible with Bibliometrix-Python.

## Environment

Python packages used:

- pandas
- requests
- openpyxl

## Command Used

```bash
python .\src\main.py

Base Level Execution

The base-level pipeline used a local Scopus-like CSV file.

Input file:

data_raw/scopus_sample.csv

Generated files:

outputs/standardized_scopus_sample.csv
outputs/scopus_first_5_normalized_rows.csv
outputs/validation_report_scopus.txt

Validation result:

PASSED: Standardized data is valid.
Advanced Level Execution

The advanced-level pipeline used the OpenAlex API.

API query:

bibliometric analysis

Generated files:

outputs/raw_openalex_records.csv
outputs/standardized_openalex_api.csv
outputs/openalex_first_5_normalized_rows.csv
outputs/validation_report_openalex.txt

Validation result:

PASSED: Standardized data is valid.
Final Terminal Result

The terminal showed:

PROJECT EXECUTION COMPLETED
Conclusion

The ETL pipeline successfully extracted records from both a local file and the OpenAlex API, transformed them into a Web of Science-like schema, validated the standardized data, and exported CSV outputs.
118 changes: 118 additions & 0 deletions www/services/bibliometrix_etl/outputs/analysis_validation_report.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
Bibliometrix ETL Analysis Validation Report
======================================================================

======================================================================
Analysis Validation for: Base Level - Scopus Local File
======================================================================

1. Total Records
Total records: 3

2. Publications by Year
PY
2022 1
2023 1
2024 1

3. Top Source Titles
SO
Journal of Data Science 1
Scientometrics Review 1
International Journal of Information Systems 1

4. Top Authors
Smith J. 1
Rahman A. 1
Chen L. 1
Karim M. 1
Lee K. 1
Hasan R. 1

5. Top Keywords or Index Terms
ETL 2
bibliometrics 1
artificial intelligence 1
research trends 1
OpenAlex 1
bibliometrix 1
Python 1
bibliographic data 1
data transformation 1

6. Citation Summary
Total citations: 46
Average citations: 15.33
Maximum citations: 22

Validation Status: PASSED
The standardized CSV file can be used for bibliometric-style analysis.

======================================================================
Analysis Validation for: Advanced Level - OpenAlex API
======================================================================

1. Total Records
Total records: 50

2. Publications by Year
PY
2005 1
2007 1
2009 2
2010 1
2014 1
2015 8
2016 2
2017 5
2018 6
2019 7
2020 7
2021 3
2022 3
2023 2
2024 1

3. Top Source Titles
SO
Journal of Business Research 8
Scientometrics 4
Sustainability 3
International Journal of Production Economics 2
El Profesional de la Informacion 1
International Journal of Consumer Studies 1
Encyclopedia 1
Journal of the Association for Information Science and Technology 1
Omega 1
Journal of Scientometric Research 1

4. Top Authors
Satish Kumar 7
José M. Merigó 4
Naveen Donthu 3
Nitesh Pandey 3
Weng Marc Lim 3
Enrique Herrera‐Viedma 3
Manuel J. Cobo 2
Domingo Ribeiro Soriano 2
Debidutta Pattnaik 2
Debmalya Mukherjee 1

5. Top Keywords or Index Terms
Computer science 43
Bibliometrics 35
Political science 30
Library science 29
Data science 26
Sociology 21
Citation 18
MEDLINE 17
Field (mathematics) 15
Engineering 13

6. Citation Summary
Total citations: 71312
Average citations: 1426.24
Maximum citations: 19703

Validation Status: PASSED
The standardized CSV file can be used for bibliometric-style analysis.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
DB,UT,DI,PMID,TI,SO,JI,PY,DT,LA,TC,AU,AF,C1,RP,CR,DE,ID,AB,VL,IS,BP,EP,SR
OPENALEX,W3160856016,10.1016/j.jbusres.2021.04.070,,How to conduct a bibliometric analysis: An overview and guidelines,Journal of Business Research,,2021,article,en,12082,Naveen Donthu; Satish Kumar; Debmalya Mukherjee; Nitesh Pandey; Weng Marc Lim,Naveen Donthu; Satish Kumar; Debmalya Mukherjee; Nitesh Pandey; Weng Marc Lim,Georgia State University; Malaviya National Institute of Technology Jaipur; Swinburne University of Technology Sarawak Campus; University of Akron; Malaviya National Institute of Technology Jaipur; Swinburne University of Technology; Swinburne University of Technology Sarawak Campus,,,Bibliometrics; Field (mathematics); Data science; Resource (disambiguation); Computer science; Management science; Focus (optics); Library science; Engineering; Mathematics,Bibliometrics; Field (mathematics); Data science; Resource (disambiguation); Computer science; Management science; Focus (optics); Library science; Engineering; Mathematics; Computer network; Physics; Optics; Pure mathematics,,,,,,"Naveen Donthu, 2021, Journal of Business Research"
OPENALEX,W1021000864,10.1007/s11192-015-1645-z,,The bibliometric analysis of scholarly production: How great is the impact?,Scientometrics,,2015,article,en,2987,Ole Ellegaard; Johan Albert Wallin,Ole Ellegaard; Johan Albert Wallin,University of Southern Denmark; University of Southern Denmark,,,Bibliometrics; Production (economics); Knowledge production; Regional science; Citation analysis; Computer science; Sociology; Library science; Knowledge management; Economics; Citation,Bibliometrics; Production (economics); Knowledge production; Regional science; Citation analysis; Computer science; Sociology; Library science; Knowledge management; Economics; Citation; Macroeconomics,"Bibliometric methods or ""analysis"" are now firmly established as scientific specialties and are an integral part of research evaluation methodology especially within the scientific and applied fields. The methods are used increasingly when studying various aspects of science and also in the way institutions and universities are ranked worldwide. A sufficient number of studies have been completed, and with the resulting literature, it is now possible to analyse the bibliometric method by using its own methodology. The bibliometric literature in this study, which was extracted from Web of Science, is divided into two parts using a method comparable to the method of Jonkers et al. (Characteristics of bibliometrics articles in library and information sciences (LIS) and other journals, pp. 449-551, 2012: The publications either lie within the Information and Library Science (ILS) category or within the non-ILS category which includes more applied, ""subject"" based studies. The impact in the different groupings is judged by means of citation analysis using normalized data and an almost linear increase can be observed from 1994 onwards in the non-ILS category. The implication for the dissemination and use of the bibliometric methods in the different contexts is discussed. A keyword analysis identifies the most popular subjects covered by bibliometric analysis, and multidisciplinary articles are shown to have the highest impact. A noticeable shift is observed in those countries which contribute to the pool of bibliometric analysis, as well as a self-perpetuating effect in giving and taking references.",,,,,"Ole Ellegaard, 2015, Scientometrics"
OPENALEX,W1965746216,10.1016/j.ijpe.2015.01.003,,Green supply chain management: A review and bibliometric analysis,International Journal of Production Economics,,2015,review,en,2058,Behnam Fahimnia; Joseph Sarkis; Hoda Davarzani,Behnam Fahimnia; Joseph Sarkis; Hoda Davarzani,The University of Sydney; Worcester Polytechnic Institute; The University of Sydney,,,Field (mathematics); Supply chain management; Identification (biology); Bibliometrics; Computer science; Supply chain; Management science; Data science; Systematic review; Data mining; Business; Engineering; Political science; MEDLINE; Marketing,Field (mathematics); Supply chain management; Identification (biology); Bibliometrics; Computer science; Supply chain; Management science; Data science; Systematic review; Data mining; Business; Engineering; Political science; MEDLINE; Marketing; Law; Mathematics; Biology; Pure mathematics; Botany,,,,,,"Behnam Fahimnia, 2015, International Journal of Production Economics"
OPENALEX,W3001491100,10.3145/epi.2020.ene.03,,Software tools for conducting bibliometric analysis in science: An up-to-date review,El Profesional de la Informacion,,2020,article,es,1587,José A. Moral-Muñoz; Enrique Herrera‐Viedma; Antonio Santisteban‐Espejo; Manuel J. Cobo,José A. Moral-Muñoz; Enrique Herrera‐Viedma; Antonio Santisteban‐Espejo; Manuel J. Cobo,Universidad de Cádiz; Universidad de Granada; Hospital Universitario Puerta del Mar; Universidad de Cádiz,,,Bibliometrics; Visualization; Data science; Computer science; Data visualization; Set (abstract data type); Scientometrics; Database; Data mining; World Wide Web,Bibliometrics; Visualization; Data science; Computer science; Data visualization; Set (abstract data type); Scientometrics; Database; Data mining; World Wide Web; Programming language,"Bibliometrics has become an essential tool for assessing and analyzing the output of scientists, cooperation between universities, the effect of state-owned science funding on national research and development performance and educational efficiency, among other applications. Therefore, professionals and scientists need a range of theoretical and practical tools to measure experimental data. This review aims to provide an up-to-date review of the various tools available for conducting bibliometric and scientometric analyses, including the sources of data acquisition, performance analysis and visualization tools. The included tools were divided into three categories: general bibliometric and performance analysis, science mapping analysis, and libraries; a description of all of them is provided. A comparative analysis of the database sources support, pre-processing capabilities, analysis and visualization options were also provided in order to facilitate its understanding. Although there are numerous bibliometric databases to obtain data for bibliometric and scientometric analysis, they have been developed for a different purpose. The number of exportable records is between 500 and 50,000 and the coverage of the different science fields is unequal in each database. Concerning the analyzed tools, Bibliometrix contains the more extensive set of techniques and suitable for practitioners through Biblioshiny. VOSviewer has a fantastic visualization and is capable of loading and exporting information from many sources. SciMAT is the tool with a powerful pre-processing and export capability. In views of the variability of features, the users need to decide the desired analysis output and chose the option that better fits into their aims.",,,,,"José A. Moral-Muñoz, 2020, El Profesional de la Informacion"
OPENALEX,W3044902155,10.1111/ijcs.12605,,Financial literacy: A systematic review and bibliometric analysis,International Journal of Consumer Studies,,2020,review,en,1071,Kirti Goyal; Satish Kumar,Kirti Goyal; Satish Kumar,Malaviya National Institute of Technology Jaipur; Malaviya National Institute of Technology Jaipur,,,Financial literacy; Citation; Content analysis; Citation analysis; Bibliometrics; Literacy; Financial analysis; Accounting; Political science; Sociology; Social science; Business; Library science; Finance; Computer science; Pedagogy,Financial literacy; Citation; Content analysis; Citation analysis; Bibliometrics; Literacy; Financial analysis; Accounting; Political science; Sociology; Social science; Business; Library science; Finance; Computer science; Pedagogy,"Abstract Given the paucity of comprehensive summaries in the extant literature, this systematic review, coupled with bibliometric analysis, endeavours to take a meticulous approach intended at presenting quantitative and qualitative knowledge on the ever‐emerging subject of financial literacy. The study comprises a review of 502 articles ‐ published in peer‐reviewed journals from 2000 to 2019. Citation network, page‐rank analysis, co‐citation analysis, content analysis and publication trends have been employed to identify influential work, delineate the intellectual structure of the field and identify gaps. The most prominent journals, authors, countries, articles and themes have been identified using bibliometric analysis, followed by a comprehensive analysis of the content of 107 papers in the identified clusters. The three major themes enumerated are—levels of financial literacy amongst distinct cohorts, the influence that financial literacy exerts on financial planning and behaviour, and the impact of financial education. Additionally, content analysis of 175 papers has been conducted for the last four years’ articles that were not covered in the co‐citation analysis. Emerging themes identified include financial capability, financial inclusion, gender gap, tax & insurance literacy, and digital financial education. A conceptual framework has been modelled portraying the complete picture, following which potential areas of research have been suggested. This study will help policy‐makers, regulators and academic researchers know the nuts and bolts of financial literacy, and identify the relevant areas that need investigation.",,,,,"Kirti Goyal, 2020, International Journal of Consumer Studies"
Loading