Pandas Medicine Analysis

Exploratory data analysis of the Bangladesh medicine dataset using Pandas, Matplotlib, Plotly, and WordCloud.

PythonPandasPlotlyMatplotlibWordCloudBeautifulSoupJupyter Notebook
GitHubView Notebook

March 2026

Overview

A data analysis project exploring the Assorted Medicine Dataset of Bangladesh, containing 21,000+ medicines. The goal is to demonstrate core data analyst skills — data cleaning, transformation, descriptive statistics, and visualization — using real-world pharmaceutical data.

Features

  • Automatic dataset download from Kaggle with skip-if-exists logic
  • Data cleaning: dropping unnecessary columns, handling missing values, parsing HTML descriptions with BeautifulSoup
  • Price extraction using regex to calculate unit prices from complex package size strings
  • Interactive visualizations: pie charts, bar charts, and histograms with Plotly
  • Word cloud generation for drug classes and medical indications
  • Missing value analysis with percentage breakdowns and bar chart visualization

Architecture

  • Single Jupyter Notebook — all analysis in main.ipynb, structured into sections: Medicine, Generic, Manufacturer, Dosage Form, Drug Class, and Indication
  • Pandas for all data loading, cleaning, and transformation — value counts, explode, groupby, string extraction
  • Plotly for interactive charts (pie, bar, histogram)
  • Matplotlib + WordCloud for word cloud visualizations
  • BeautifulSoup for parsing HTML content in generic description columns
  • Dataset auto-downloaded from Kaggle API on first run

Learnings

  • Practiced handling messy real-world data: multi-value columns, embedded HTML, currency symbols (৳ Bangladeshi Taka), and inconsistent price formats
  • Deepened understanding of Pandas — value counts, groupby, explode, string methods, and chaining operations for data transformation
  • Gained experience with Matplotlib for static visualizations and word cloud rendering
  • Gained experience building interactive Plotly charts for data exploration
  • Understood the importance of missing value analysis before drawing conclusions from incomplete columns