Pandas Medicine Analysis
Exploratory data analysis of the Bangladesh medicine dataset using Pandas, Matplotlib, Plotly, and WordCloud.
PythonPandasPlotlyMatplotlibWordCloudBeautifulSoupJupyter Notebook
March 2026
Overview
A data analysis project exploring the Assorted Medicine Dataset of Bangladesh, containing 21,000+ medicines. The goal is to demonstrate core data analyst skills — data cleaning, transformation, descriptive statistics, and visualization — using real-world pharmaceutical data.
Features
- Automatic dataset download from Kaggle with skip-if-exists logic
- Data cleaning: dropping unnecessary columns, handling missing values, parsing HTML descriptions with BeautifulSoup
- Price extraction using regex to calculate unit prices from complex package size strings
- Interactive visualizations: pie charts, bar charts, and histograms with Plotly
- Word cloud generation for drug classes and medical indications
- Missing value analysis with percentage breakdowns and bar chart visualization
Architecture
- Single Jupyter Notebook — all analysis in
main.ipynb, structured into sections: Medicine, Generic, Manufacturer, Dosage Form, Drug Class, and Indication - Pandas for all data loading, cleaning, and transformation — value counts, explode, groupby, string extraction
- Plotly for interactive charts (pie, bar, histogram)
- Matplotlib + WordCloud for word cloud visualizations
- BeautifulSoup for parsing HTML content in generic description columns
- Dataset auto-downloaded from Kaggle API on first run
Learnings
- Practiced handling messy real-world data: multi-value columns, embedded HTML, currency symbols (৳ Bangladeshi Taka), and inconsistent price formats
- Deepened understanding of Pandas — value counts, groupby, explode, string methods, and chaining operations for data transformation
- Gained experience with Matplotlib for static visualizations and word cloud rendering
- Gained experience building interactive Plotly charts for data exploration
- Understood the importance of missing value analysis before drawing conclusions from incomplete columns