Marketing-Group

The analysis focused on segmenting a specific group of mall customers using the KMeans unsupervised machine learning algorithm. Univariate, bivariate, and multivariate clusters were identified and analyzed using summary statistics to gain insights into customer behavior and determine the most valuable segment for targeted marketing strategies.

The objective of this problem is to identify the most important shopping groups by analyzing customer characteristics such as income level, age, and mall shopping score. By examining patterns and similarities within these variables, the task aims to determine the ideal number of distinct customer groups that best represent the underlying structure of the data. Each customer is then assigned a clear group label, enabling better understanding of shopping behaviors and supporting targeted marketing strategies, personalized services, and data-driven business decisions.

Approach

Perform EDA
Use K-means clustering algorithm to create segments
Summarize Statistics on the clusters

Exploratory Data Analysis (Univariate & Bivariate)

Univariate Analysis

Focusing on variable at a time allows use to find patterns within that single variable.

We can notice that both columns for "Age" and "Annual Income (k$)" present a positive skew where most of the data is concentrated in the lower to middle income ranges.

The symmetrical distribution can be observed with the column "Spending Score (1-100)," where the data appears fairly balanced around the center (around a spending score of 50), with similar frequencies on both the lower and higher sides.

To gather more information from the data, this information can be separated it into dimensions, such as gender, to achieve a more comprehensive analysis and identify frequencies and outliers.

Using frequency distribution graphs is possible to notice the following data: Both male and female income distributions peak in the mid-income range (roughly $40k–$80k), indicating that most customers fall within this bracket.

The female distribution is more concentrated around the center, suggesting less variability and a higher frequency of middle-income earners.
The male distribution has a longer right tail, indicating greater variability and the presence of higher-income outliers, which causes a slight right skew.

Using the frequency distribution graph is possible to notice the following data: Overall overlap: The two curves overlap a lot, meaning many males and females have similar spending scores.

Females show a higher peak around 45–55, suggesting a larger proportion of female customers cluster in the mid-to-high spending range.
Males have a broader, flatter distribution, indicating more variability in spending behavior.
Males appear slightly more represented at lower spending scores (around 0–25) compared to females.
Both genders extend into the high spending range (80–100), but females show a slightly stronger presence around 70–80.
Both males and females are most concentrated between 20 and 45 years, indicating this is the dominant customer age range.
Female customers: Peak density is around 30–35 years, showing a strong concentration in early adulthood.
Male customersThe peak is slightly younger (mid-to-late 20s) and the curve is broader, indicating greater age diversity.
Males extend more into older age groups (50–70+) than females.

Bivariarte Analysis

A scatter plot is employed to assess the relationship between "Spending Score" and "Annual Income." Preliminary visual inspection, without the application of advanced analytical techniques, indicates the presence of distinct groupings within the data. Specifically, the observations suggest the existence of approximately five clusters, as well as a limited number of outliers.

To efficiently examine multivariate relationships, marginal distributions, and structural patterns within the dataset, a pair plot is employed, with gender used as a categorical differentiating variable. This visualization facilitates the preliminary identification of potential cluster structures in the data prior to formal clustering analysis.

As a last step, a heatmap can be generated to observe the correlation between the data.

K-Means Algorithm Clustering (Univariate, Bivariate & Multivarite)

Univariate Clustering

In the next step, the Annual Income data is fitted to a K-means clustering algorithm, which partitions the dataset into distinct groups. The resulting cluster labels are then compared with the original data by assigning them as a new feature:

df['Income Cluster'] = clustering1.labels_

This approach enables the computation of summary statistics for each identified cluster.

(This process is iterative and is modified with each pass to obtain optimal number of clusters) Now is possible to do summary statistics around the univariate cluster.

We can check how many of the customers fall on each cluster.

df['Income Cluster'].value_counts()

We can see that cluster #2 contains a higher count of customers and cluster #1 contains a lower count of customers.

clustering1.inertia_

This represents the distance between the centroids. (Clusters from range 1 to 11 were generated to to obtained ideal clusters) (This table contains optimal number of cluster, for whole proceess check Colab Notebook)

The next step is to fit the possible clusters in "Annual Income" and append them to the inertia score.

inertia_scores=[]
for i in range(1,11):
  kmeans=KMeans(n_clusters=i)
  kmeans.fit(df[['Annual Income (k$)']])
  inertia_scores.append(kmeans.inertia_)

An elbow plot is generated to determine the optimal number of clusters. It is observable that the elbow occurs between clusters 2 and 4 meaning that the optimal number of clusters is 3.

(After conducting an analysis of the iterative process to obtain the optimal number of clusters is possible to produce the following table)

By aggregating "Age, "Annual Income," and "Spending Score" to "Income Clusters" and obtaining the mean foe the tables is possible to observe that people in cluster 0 contains the highest annual incomme, the ones in cluster 2 contains the "middle" income and cluster 1 is the demographic with less annual income.

Bivariate Clustering

(For this section can use same method to find the ideal number of clusters)

The "Annual Income" and "Spending Score" columns need to be fitted to transform raw, unlabeled data into actionable, categorized information. This produce the following table:

This table can be optimized using the optimal number of clusters (The same method used in the univariate clustering can be used here).

inertia_scores2=[]
for i in range(1,11):
  kmeans2=KMeans(n_clusters=i)
  kmeans2.fit(df[['Annual Income (k$)', 'Spending Score (1-100)']])
  inertia_scores2.append(kmeans2.inertia_)

With this optimization it is possible to generate an elbow table and create a new scatter plott to gather relevant information regarding the spending trend between the demographic.

After analysing the elbow plot it is possible to observe that the elbow occurs between clusters 4 and 6 meaning that the optimal number of clusters is 5.

With this new information the next step is to do a visual analysis with the correct number of clusters. A scatter plot allow the visualization of relationships, correlations and patter between the variables.

Using "Annual income" as the x-axis and the "Spending Score" as the y-axis it is possible to determining the following information:

Cluster 0 contains high income and high spending.
- "Premium customers could be located in this demographic.
Cluster 1 contains low income and high spending.
- Enthusiastic spenders despite lower income.
Cluster 2 contains high income and low spending.
- Financially capable but not engaged. This is untapped potential.
Cluster 3 contains mid income and mid spending.
- Reliable costumers.
Cluster 4 contains very high income and mid spending (closer to low spending).
- High earners with inconsistent behavior
Clusters 0 and 1 are the strongest revenue drivers.
cluster 2 is the biggest growth opportunity.
Cluster 3 contains the demographic that spends as much as it earns.
Cluster 4 may be worth deeper analysis..

It is possible to check for gender and average age of the clusters for deeper analysis

Multivariate Analysis

The first step for handling multivariate analysis is to do scale the data to allow the algorithm to ensure that features with larger numerical ranges do not disproportionately dominate.

In order to scale the data correctly the following steps are follow

Apply one hot encoder to "Gender" column to get rid of male and female labels and instead replacing it with 0 and 1
Dropping the useless colums
- "Gender" - Strings
- "Income Cluster"
- "Spending and Income Cluster"

Now it is possible to scale the data using the new table into a new dataframe to generete the correct analysis.

inertia_scores3=[]
for i in range(1,11):
  kmeans3 = KMeans(n_clusters=i)
  kmeans3.fit(dff)
  inertia_scores3.append(kmeans3.inertia_)

After analysing the elbow plot it is possible to observe that the elbow occurs between clusters 3 and 6 where cluster 4 shows a curve that starts to flatten with each additional step.

(More analysis can be done but is not needed to obtain thee relevant needed information)

Summarize Statistics on the clusters

Target group would be cluster 0 which has a high "Spending Score"
54% of cluster 0 shoppers are women. Marketing team shoul look for ways to attract these costumers using marketing campaign targeting popular intems in this demographic
Cluster 1 presents an opportunity to market to the customers for sales event in popular items.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Annual_income_gender.png		Annual_income_gender.png
Elbow Plot Bivariate.png		Elbow Plot Bivariate.png
Elbow Plot Multi.png		Elbow Plot Multi.png
Elbow plot.png		Elbow plot.png
Mall.ipynb		Mall.ipynb
Mall_Customers.csv		Mall_Customers.csv
README.md		README.md
Spending_Score_Uni.png		Spending_Score_Uni.png
age_gender.png		age_gender.png
age_uni.png		age_uni.png
annual_income_analysis_uni.png		annual_income_analysis_uni.png
heatmap.png		heatmap.png
paitplot.png		paitplot.png
scatter plot bivariate.png		scatter plot bivariate.png
scatter_plot_income_spending.png		scatter_plot_income_spending.png
spending_score_gender.png		spending_score_gender.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Marketing-Group

Approach

Exploratory Data Analysis (Univariate & Bivariate)

Univariate Analysis

Bivariarte Analysis

K-Means Algorithm Clustering (Univariate, Bivariate & Multivarite)

Univariate Clustering

Bivariate Clustering

Multivariate Analysis

Summarize Statistics on the clusters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Marketing-Group

Approach

Exploratory Data Analysis (Univariate & Bivariate)

Univariate Analysis

Bivariarte Analysis

K-Means Algorithm Clustering (Univariate, Bivariate & Multivarite)

Univariate Clustering

Bivariate Clustering

Multivariate Analysis

Summarize Statistics on the clusters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages