Brazilian Ecommerce Dataset

Importing the data into Python variables

Import applicable libraries

Code

import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
from scipy.stats import chi2_contingency
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Read the CSV into Pandas dataframes

Code

df_customers = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_customers_dataset.csv")
df_geolocation = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_geolocation_dataset.csv")
df_order_items = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_items_dataset.csv")
df_order_payments = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_payments_dataset.csv")                            
df_order_reviews = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_reviews_dataset.csv")
df_orders = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_orders_dataset.csv")
df_products = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_products_dataset.csv")
df_sellers = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_sellers_dataset.csv")
df_product_category_name = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/product_category_name_translation.csv")

View the first few rows of the tables to get an idea of what is in them

Code

df_customers.head()

	customer_id	customer_unique_id	customer_zip_code_prefix	customer_city	customer_state
0	06b8999e2fba1a1fbc88172c00ba8bc7	861eff4711a542e4b93843c6dd7febb0	14409	franca	SP
1	18955e83d337fd6b2def6b18a428ac77	290c77bc529b7ac935b93aa66c333dc3	9790	sao bernardo do campo	SP
2	4e7b3e00288586ebd08712fdd0374a03	060e732b5b29e8181a18229c7b0b2b5e	1151	sao paulo	SP
3	b2b6027bc5c5109e529d4dc6358b12c3	259dac757896d24d7702b9acbbff3f3c	8775	mogi das cruzes	SP
4	4f2d8ab171c80ec8364f7c12e35b23ad	345ecd01c38d18a9036ed96c73b8d066	13056	campinas	SP

Code

df_geolocation.head()

	geolocation_zip_code_prefix	geolocation_lat	geolocation_lng	geolocation_city	geolocation_state
0	1037	-23.545621	-46.639292	sao paulo	SP
1	1046	-23.546081	-46.644820	sao paulo	SP
2	1046	-23.546129	-46.642951	sao paulo	SP
3	1041	-23.544392	-46.639499	sao paulo	SP
4	1035	-23.541578	-46.641607	sao paulo	SP

Code

df_order_items.head()

	order_id	order_item_id	product_id	seller_id	shipping_limit_date	price	freight_value
0	00010242fe8c5a6d1ba2dd792cb16214	1	4244733e06e7ecb4970a6e2683c13e61	48436dade18ac8b2bce089ec2a041202	2017-09-19 09:45:35	58.90	13.29
1	00018f77f2f0320c557190d7a144bdd3	1	e5f2d52b802189ee658865ca93d83a8f	dd7ddc04e1b6c2c614352b383efe2d36	2017-05-03 11:05:13	239.90	19.93
2	000229ec398224ef6ca0657da4fc703e	1	c777355d18b72b67abbeef9df44fd0fd	5b51032eddd242adc84c38acab88f23d	2018-01-18 14:48:30	199.00	17.87
3	00024acbcdf0a6daa1e931b038114c75	1	7634da152a4610f1595efa32f14722fc	9d7a1d34a5052409006425275ba1c2b4	2018-08-15 10:10:18	12.99	12.79
4	00042b26cf59d7ce69dfabb4e55b4fd9	1	ac6c3623068f30de03045865e4e10089	df560393f3a51e74553ab94004ba5c87	2017-02-13 13:57:51	199.90	18.14

Code

df_order_payments.head()

	order_id	payment_sequential	payment_type	payment_installments	payment_value
0	b81ef226f3fe1789b1e8b2acac839d17	1	credit_card	8	99.33
1	a9810da82917af2d9aefd1278f1dcfa0	1	credit_card	1	24.39
2	25e8ea4e93396b6fa0d3dd708e76c1bd	1	credit_card	1	65.71
3	ba78997921bbcdc1373bb41e913ab953	1	credit_card	8	107.78
4	42fdf880ba16b47b59251dd489d4441a	1	credit_card	2	128.45

Code

df_order_reviews.head()

	review_id	order_id	review_score	review_comment_title	review_comment_message	review_creation_date	review_answer_timestamp
0	7bc2406110b926393aa56f80a40eba40	73fc7af87114b39712e6da79b0a377eb	4	NaN	NaN	2018-01-18 00:00:00	2018-01-18 21:46:59
1	80e641a11e56f04c1ad469d5645fdfde	a548910a1c6147796b98fdf73dbeba33	5	NaN	NaN	2018-03-10 00:00:00	2018-03-11 03:05:13
2	228ce5500dc1d8e020d8d1322874b6f0	f9e4b658b201a9f2ecdecbb34bed034b	5	NaN	NaN	2018-02-17 00:00:00	2018-02-18 14:36:24
3	e64fb393e7b32834bb789ff8bb30750e	658677c97b385a9be170737859d3511b	5	NaN	Recebi bem antes do prazo estipulado.	2017-04-21 00:00:00	2017-04-21 22:02:06
4	f7c4243c7fe1938f181bec41a392bdeb	8e6bfb81e283fa7e4f11123a3fb894f1	5	NaN	Parabéns lojas lannister adorei comprar pela I...	2018-03-01 00:00:00	2018-03-02 10:26:53

Code

df_orders.head()

	order_id	customer_id	order_status	order_purchase_timestamp	order_approved_at	order_delivered_carrier_date	order_delivered_customer_date	order_estimated_delivery_date
0	e481f51cbdc54678b7cc49136f2d6af7	9ef432eb6251297304e76186b10a928d	delivered	2017-10-02 10:56:33	2017-10-02 11:07:15	2017-10-04 19:55:00	2017-10-10 21:25:13	2017-10-18 00:00:00
1	53cdb2fc8bc7dce0b6741e2150273451	b0830fb4747a6c6d20dea0b8c802d7ef	delivered	2018-07-24 20:41:37	2018-07-26 03:24:27	2018-07-26 14:31:00	2018-08-07 15:27:45	2018-08-13 00:00:00
2	47770eb9100c2d0c44946d9cf07ec65d	41ce2a54c0b03bf3443c3d931a367089	delivered	2018-08-08 08:38:49	2018-08-08 08:55:23	2018-08-08 13:50:00	2018-08-17 18:06:29	2018-09-04 00:00:00
3	949d5b44dbf5de918fe9c16f97b45f8a	f88197465ea7920adcdbec7375364d82	delivered	2017-11-18 19:28:06	2017-11-18 19:45:59	2017-11-22 13:39:59	2017-12-02 00:28:42	2017-12-15 00:00:00
4	ad21c59c0840e6cb83a9ceb5573f8159	8ab97904e6daea8866dbdbc4fb7aad2c	delivered	2018-02-13 21:18:39	2018-02-13 22:20:29	2018-02-14 19:46:34	2018-02-16 18:17:02	2018-02-26 00:00:00

Code

df_products.head()

	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
0	1e9e8ef04dbcff4541ed26657ea517e5	perfumaria	40.0	287.0	1.0	225.0	16.0	10.0	14.0
1	3aa071139cb16b67ca9e5dea641aaa2f	artes	44.0	276.0	1.0	1000.0	30.0	18.0	20.0
2	96bd76ec8810374ed1b65e291975717f	esporte_lazer	46.0	250.0	1.0	154.0	18.0	9.0	15.0
3	cef67bcfe19066a932b7673e239eb23d	bebes	27.0	261.0	1.0	371.0	26.0	4.0	26.0
4	9dc1a7de274444849c219cff195d0b71	utilidades_domesticas	37.0	402.0	4.0	625.0	20.0	17.0	13.0

Code

df_sellers.head()

	seller_id	seller_zip_code_prefix	seller_city	seller_state
0	3442f8959a84dea7ee197c632cb2df15	13023	campinas	SP
1	d1b65fc7debc3361ea86b5f14c68d2e2	13844	mogi guacu	SP
2	ce3ad9de960102d0677a81f5d0bb7b2d	20031	rio de janeiro	RJ
3	c0f3eea2e14555b6faeea3dd58c1b1c3	4195	sao paulo	SP
4	51a04a8a6bdcb23deccc82b0b80742cf	12914	braganca paulista	SP

Code

df_product_category_name.head()

	product_category_name	product_category_name_english
0	beleza_saude	health_beauty
1	informatica_acessorios	computers_accessories
2	automotivo	auto
3	cama_mesa_banho	bed_bath_table
4	moveis_decoracao	furniture_decor

Iterate through each dataframe and each column within it to check for empty cells or question marks or NaN

Code

#First store all the dataframes in a dictionary

df_dict = {'df_customers' : df_customers, 'df_geolocation' : df_geolocation, 'df_order_items' : df_order_items, 'df_order_payments' : df_order_payments
           , 'df_order_reviews' : df_order_reviews, 'df_orders' : df_orders, 'df_products' : df_products, 'df_sellers' : df_sellers, 
           'df_product_category_name' : df_product_category_name}


#Iterate through each dataframe in the dictionary

for name, data_frame in df_dict.items():


#Create new tables for columns with empty cells, question marks, or NaN
    
    empty_cells = data_frame.isnull()
    has_question_mark = (data_frame == '?')

#Iterate through these tables and count the total number of "True" values and print these counts if there are any true values
    
    for column in empty_cells.columns.values.tolist():
        if empty_cells[column].any():
            print(f"Table: {name} | Column: {column}")
            print("Empty Cells")
            print (empty_cells[column].value_counts())
            print("")

    for column in has_question_mark.columns.values.tolist():
        if has_question_mark[column].any():
            print(f"Table: {name} | Column: {column}")
            print("Has Question Mark")
            print (has_question_mark[column].value_counts())
            print("")

Table: df_order_reviews | Column: review_comment_title
Empty Cells
review_comment_title
True     87656
False    11568
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_message
Empty Cells
review_comment_message
True     58247
False    40977
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_title
Has Question Mark
review_comment_title
False    99223
True         1
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_message
Has Question Mark
review_comment_message
False    99221
True         3
Name: count, dtype: int64

Table: df_orders | Column: order_approved_at
Empty Cells
order_approved_at
False    99281
True       160
Name: count, dtype: int64

Table: df_orders | Column: order_delivered_carrier_date
Empty Cells
order_delivered_carrier_date
False    97658
True      1783
Name: count, dtype: int64

Table: df_orders | Column: order_delivered_customer_date
Empty Cells
order_delivered_customer_date
False    96476
True      2965
Name: count, dtype: int64

Table: df_products | Column: product_category_name
Empty Cells
product_category_name
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_name_lenght
Empty Cells
product_name_lenght
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_description_lenght
Empty Cells
product_description_lenght
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_photos_qty
Empty Cells
product_photos_qty
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_weight_g
Empty Cells
product_weight_g
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_length_cm
Empty Cells
product_length_cm
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_height_cm
Empty Cells
product_height_cm
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_width_cm
Empty Cells
product_width_cm
False    32949
True         2
Name: count, dtype: int64

Rename misspelled column names in product table

Code

df_products.rename(columns = {"product_name_lenght" : "product_name_length"}, inplace=True)
df_products.rename(columns = {"product_description_lenght" : "product_description_length"}, inplace=True)

Analyze NULL values in table and replace them strategically where appropriate

It is reasonable that the order reviews and title would be left blank or have a question mark.

After reviewing the orders table and the order_reviews table I determined that most of these products that are empties simply were never delivered. The people often left reviews to the effect that they were upset, they left a poor rating for the company, and stated that they never received the product (consistent with the empty order_delivered_customer_date cells). Thus I will simply leave these cells blank. It is not a data entry error, or anything like that.

In cases where the order_approved_at column was empty, it appears the products were out of stock, and this wasn’t always clearly communicated to the online shopper, as at least one was under the impression that their order went through and was upset that they didn’t ever receive their order.

It appears that some replacement is in order in the case of orders that were never approved but were delivered. I used the average “approval time” found in SQLite in hours to find the expected approval time for these NULL values below:

Code

#Ensure the order_purchase_timestamp column is indeed a timestamp 
df_orders['order_purchase_timestamp'] = pd.to_datetime(df_orders['order_purchase_timestamp'])

#Record the average time to approve an order from the order_purchase_timestamp
avg_time_to_approve = 10.4190943019277

#Create my condition
condition = (df_orders['order_status'] == 'delivered') & (df_orders['order_approved_at'].isnull())

#Update entries that meet this condition with an "average"
df_orders.loc[condition, 'order_approved_at'] = (df_orders.loc[condition, 'order_purchase_timestamp'] + pd.to_timedelta(avg_time_to_approve, unit='h'))

In the case of product_category name and all of its NULL values, it appears these are simply miscellaneous products that did not fit under one of the main categories that are already listed. I think the best thing to do in this case is to create a new category called “Miscellaneous” for these

Code

df_products['product_category_name'] = df_products['product_category_name'].fillna('Miscellaneous')

This ensures that there aren’t any product_category_names that are NULL. Since it is evident that the remaining NULL/empty cells in the table are associated with specific products that were not categorized by the company and do not have corresponding data for them, it seems to me that the remaining fields are best left in their NULL state. For example, the quantity of pictures was sometimes NULL for these specific uncategorized/miscellaneous products, but indeed some of the reviews indicated that there were photos online of these products, so simply putting 0 would be inaccurate.

Exploratory analysis

Loop through the dictionary with the dataframes in it and print some basic statistical summary data

Code

df_customers.describe(include = "all")

	customer_id	customer_unique_id	customer_zip_code_prefix	customer_city	customer_state
count	99441	99441	99441.000000	99441	99441
unique	99441	96096	NaN	4119	27
top	06b8999e2fba1a1fbc88172c00ba8bc7	8d50f5eadf50201ccdcedfb9e2ac8455	NaN	sao paulo	SP
freq	1	17	NaN	15540	41746
mean	NaN	NaN	35137.474583	NaN	NaN
std	NaN	NaN	29797.938996	NaN	NaN
min	NaN	NaN	1003.000000	NaN	NaN
25%	NaN	NaN	11347.000000	NaN	NaN
50%	NaN	NaN	24416.000000	NaN	NaN
75%	NaN	NaN	58900.000000	NaN	NaN
max	NaN	NaN	99990.000000	NaN	NaN

Code

df_geolocation.describe(include = "all")

	geolocation_zip_code_prefix	geolocation_lat	geolocation_lng	geolocation_city	geolocation_state
count	1.000163e+06	1.000163e+06	1.000163e+06	1000163	1000163
unique	NaN	NaN	NaN	8011	27
top	NaN	NaN	NaN	sao paulo	SP
freq	NaN	NaN	NaN	135800	404268
mean	3.657417e+04	-2.117615e+01	-4.639054e+01	NaN	NaN
std	3.054934e+04	5.715866e+00	4.269748e+00	NaN	NaN
min	1.001000e+03	-3.660537e+01	-1.014668e+02	NaN	NaN
25%	1.107500e+04	-2.360355e+01	-4.857317e+01	NaN	NaN
50%	2.653000e+04	-2.291938e+01	-4.663788e+01	NaN	NaN
75%	6.350400e+04	-1.997962e+01	-4.376771e+01	NaN	NaN
max	9.999000e+04	4.506593e+01	1.211054e+02	NaN	NaN

Code


df_order_items.describe(include = "all")

	order_id	order_item_id	product_id	seller_id	shipping_limit_date	price	freight_value
count	112650	112650.000000	112650	112650	112650	112650.000000	112650.000000
unique	98666	NaN	32951	3095	93318	NaN	NaN
top	8272b63d03f5f79c56e9e4120aec44ef	NaN	aca2eb7d00ea1a7b8ebd4e68314663af	6560211a19b47992c3666cc44a7e94c0	2017-07-21 18:25:23	NaN	NaN
freq	21	NaN	527	2033	21	NaN	NaN
mean	NaN	1.197834	NaN	NaN	NaN	120.653739	19.990320
std	NaN	0.705124	NaN	NaN	NaN	183.633928	15.806405
min	NaN	1.000000	NaN	NaN	NaN	0.850000	0.000000
25%	NaN	1.000000	NaN	NaN	NaN	39.900000	13.080000
50%	NaN	1.000000	NaN	NaN	NaN	74.990000	16.260000
75%	NaN	1.000000	NaN	NaN	NaN	134.900000	21.150000
max	NaN	21.000000	NaN	NaN	NaN	6735.000000	409.680000

Code

df_order_payments.describe(include = "all")

	order_id	payment_sequential	payment_type	payment_installments	payment_value
count	103886	103886.000000	103886	103886.000000	103886.000000
unique	99440	NaN	5	NaN	NaN
top	fa65dad1b0e818e3ccc5cb0e39231352	NaN	credit_card	NaN	NaN
freq	29	NaN	76795	NaN	NaN
mean	NaN	1.092679	NaN	2.853349	154.100380
std	NaN	0.706584	NaN	2.687051	217.494064
min	NaN	1.000000	NaN	0.000000	0.000000
25%	NaN	1.000000	NaN	1.000000	56.790000
50%	NaN	1.000000	NaN	1.000000	100.000000
75%	NaN	1.000000	NaN	4.000000	171.837500
max	NaN	29.000000	NaN	24.000000	13664.080000

Code

df_order_reviews.describe(include = "all")

	review_id	order_id	review_score	review_comment_title	review_comment_message	review_creation_date	review_answer_timestamp
count	99224	99224	99224.000000	11568	40977	99224	99224
unique	98410	98673	NaN	4527	36159	636	98248
top	7b606b0d57b078384f0b58eac1d41d78	c88b1d1b157a9999ce368f218a407141	NaN	Recomendo	Muito bom	2017-12-19 00:00:00	2017-06-15 23:21:05
freq	3	3	NaN	423	230	463	4
mean	NaN	NaN	4.086421	NaN	NaN	NaN	NaN
std	NaN	NaN	1.347579	NaN	NaN	NaN	NaN
min	NaN	NaN	1.000000	NaN	NaN	NaN	NaN
25%	NaN	NaN	4.000000	NaN	NaN	NaN	NaN
50%	NaN	NaN	5.000000	NaN	NaN	NaN	NaN
75%	NaN	NaN	5.000000	NaN	NaN	NaN	NaN
max	NaN	NaN	5.000000	NaN	NaN	NaN	NaN

Code

df_orders.describe(include = "all")

	order_id	customer_id	order_status	order_purchase_timestamp	order_approved_at	order_delivered_carrier_date	order_delivered_customer_date	order_estimated_delivery_date
count	99441	99441	99441	99441	99281	97658	96476	99441
unique	99441	99441	8	98875	90733	81018	95664	459
top	e481f51cbdc54678b7cc49136f2d6af7	9ef432eb6251297304e76186b10a928d	delivered	2018-04-11 10:48:14	2018-02-27 04:31:10	2018-05-09 15:48:00	2018-05-08 23:38:46	2017-12-20 00:00:00
freq	1	1	96478	3	9	47	3	522

Code

df_products.describe(include = "all")

	product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
count	32951	32341	32341.000000	32341.000000	32341.000000	32949.000000	32949.000000	32949.000000	32949.000000
unique	32951	73	NaN	NaN	NaN	NaN	NaN	NaN	NaN
top	1e9e8ef04dbcff4541ed26657ea517e5	cama_mesa_banho	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	1	3029	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	NaN	48.476949	771.495285	2.188986	2276.472488	30.815078	16.937661	23.196728
std	NaN	NaN	10.245741	635.115225	1.736766	4282.038731	16.914458	13.637554	12.079047
min	NaN	NaN	5.000000	4.000000	1.000000	0.000000	7.000000	2.000000	6.000000
25%	NaN	NaN	42.000000	339.000000	1.000000	300.000000	18.000000	8.000000	15.000000
50%	NaN	NaN	51.000000	595.000000	1.000000	700.000000	25.000000	13.000000	20.000000
75%	NaN	NaN	57.000000	972.000000	3.000000	1900.000000	38.000000	21.000000	30.000000
max	NaN	NaN	76.000000	3992.000000	20.000000	40425.000000	105.000000	105.000000	118.000000

Code

df_sellers.describe(include = "all")

	seller_id	seller_zip_code_prefix	seller_city	seller_state
count	3095	3095.000000	3095	3095
unique	3095	NaN	611	23
top	3442f8959a84dea7ee197c632cb2df15	NaN	sao paulo	SP
freq	1	NaN	694	1849
mean	NaN	32291.059451	NaN	NaN
std	NaN	32713.453830	NaN	NaN
min	NaN	1001.000000	NaN	NaN
25%	NaN	7093.500000	NaN	NaN
50%	NaN	14940.000000	NaN	NaN
75%	NaN	64552.500000	NaN	NaN
max	NaN	99730.000000	NaN	NaN

Code

df_product_category_name.describe(include = "all")

	product_category_name	product_category_name_english
count	71	71
unique	71	71
top	beleza_saude	health_beauty
freq	1	1

Import a SQL library to run some exploratory SQL commands on the database

Code

from sqlalchemy import create_engine

# 1. Load the extension first
%reload_ext sql

# 2. Define your path
db_path = r'C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\olist.sqlite'

# 3. Create the engine (safest way to handle paths with spaces)
engine = create_engine(f"sqlite:///{db_path}")

# 4. Connect JupySQL to that engine
%sql engine

# 5. List all tables in the database to see their exact names
%sql SELECT name FROM sqlite_master WHERE type='table';

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

name
product_category_name_translation
sellers
customers
geolocation
order_items
order_payments
order_reviews
orders
products
leads_qualified

Truncated to displaylimit of 10.

Order IDs in the orders table are all unique it turns out. This is good. Order IDs in the order_items table are in some cases duplicates because the customer ordered more than one item simultaneously. This also makes sense. See below:

Code

%%sql
SELECT *
FROM order_items
WHERE order_id IN (
    -- Subquery: Find only the order_ids that appear more than once
    SELECT order_id
    FROM order_items
    GROUP BY order_id
    HAVING COUNT(order_id) > 1
)
ORDER BY order_id;

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

order_id	order_item_id	product_id	seller_id	shipping_limit_date	price	freight_value
0008288aa423d2a3f00fcb17cd7d8719	1	368c6c730842d78016ad823897a372db	1f50f920176fa81dab994f9023523100	2018-02-21 02:55:52	49.9	13.37
0008288aa423d2a3f00fcb17cd7d8719	2	368c6c730842d78016ad823897a372db	1f50f920176fa81dab994f9023523100	2018-02-21 02:55:52	49.9	13.37
00143d0f86d6fbd9f9b38ab440ac16f5	1	e95ee6822b66ac6058e2e4aff656071a	a17f621c590ea0fab3d5d883e1630ec6	2017-10-20 16:07:52	21.33	15.1
00143d0f86d6fbd9f9b38ab440ac16f5	2	e95ee6822b66ac6058e2e4aff656071a	a17f621c590ea0fab3d5d883e1630ec6	2017-10-20 16:07:52	21.33	15.1
00143d0f86d6fbd9f9b38ab440ac16f5	3	e95ee6822b66ac6058e2e4aff656071a	a17f621c590ea0fab3d5d883e1630ec6	2017-10-20 16:07:52	21.33	15.1
001ab0a7578dd66cd4b0a71f5b6e1e41	1	0b0172eb0fd18479d29c3bc122c058c2	5656537e588803a555b8eb41f07a944b	2018-01-04 02:33:42	24.89	17.63
001ab0a7578dd66cd4b0a71f5b6e1e41	2	0b0172eb0fd18479d29c3bc122c058c2	5656537e588803a555b8eb41f07a944b	2018-01-04 02:33:42	24.89	17.63
001ab0a7578dd66cd4b0a71f5b6e1e41	3	0b0172eb0fd18479d29c3bc122c058c2	5656537e588803a555b8eb41f07a944b	2018-01-04 02:33:42	24.89	17.63
001d8f0e34a38c37f7dba2a37d4eba8b	1	e67307ff0f15ade43fcb6e670be7a74c	f4aba7c0bca51484c30ab7bdc34bcdd1	2017-05-18 17:35:11	18.99	7.78
001d8f0e34a38c37f7dba2a37d4eba8b	2	e67307ff0f15ade43fcb6e670be7a74c	f4aba7c0bca51484c30ab7bdc34bcdd1	2017-05-18 17:35:11	18.99	7.78

Truncated to displaylimit of 10.

Below it is evident that the customer_unique_id helps identify repeat customers, which is great for marketing analysis.

Code

%%sql
SELECT *
FROM customers
WHERE customer_unique_id IN (
    SELECT customer_unique_id
    FROM customers
    GROUP BY customer_unique_id
    HAVING COUNT(*) > 1
)
ORDER BY customer_unique_id;

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

customer_id	customer_unique_id	customer_zip_code_prefix	customer_city	customer_state
1afe8a9c67eec3516c09a8bdcc539090	00172711b30d52eea8b313a7f2cced02	45200	jequie	BA
24b0e2bd287e47d54d193e7bbb51103f	00172711b30d52eea8b313a7f2cced02	45200	jequie	BA
1b4a75b3478138e99902678254b260f4	004288347e5e88a27ded2bb23747066c	26220	nova iguacu	RJ
f6efe5d5c7b85e12355f9d5c3db46da2	004288347e5e88a27ded2bb23747066c	26220	nova iguacu	RJ
49cf243e0d353cd418ca77868e24a670	004b45ec5c64187465168251cd1c9c2f	57055	maceio	AL
d95f60d70d9ea9a7fe37c53c931940bb	004b45ec5c64187465168251cd1c9c2f	57035	maceio	AL
8ac44e9c15d396b8c3c7cbab0fff4536	0058f300f57d7b93c477a131a59b36c3	41370	salvador	BA
f530197ea86ced9488a03d055e118ebf	0058f300f57d7b93c477a131a59b36c3	40731	salvador	BA
876356df457f952458a764348e1858bc	00a39521eb40f7012db50455bf083460	72595	brasilia	DF
cbb68c721ba9ddb30d8a490cc1897fa1	00a39521eb40f7012db50455bf083460	72595	brasilia	DF

Truncated to displaylimit of 10.

Save updated CSV Files to local disk

Now that I have come to an understanding of some of the pecularities of the dataset and have fixed some of them that appeared to need it, I am ready to save the dataframes with the updated values and fixed spelling. Exploratory analysis, looking for trends and patterns, will follow.

Code

df_customers.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\customers.csv")
df_geolocation.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\geolocation.csv")
df_order_items.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\order_items.csv")
df_order_payments.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\order_payments.csv")
df_order_reviews.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\order_reviews.csv")
df_orders.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\orders.csv")
df_products.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\products.csv")
df_sellers.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\sellers.csv")
df_product_category_name.to_csv(r"C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\My updated datasets\product_category_name.csv")

Data Analysis

Let’s get the top-selling product categories

Code

%%sql 
SELECT COUNT(*) AS ordercount, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY ordercount DESC
LIMIT 10

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

ordercount	product_category_name	product_category_name_english
11115	cama_mesa_banho	bed_bath_table
9670	beleza_saude	health_beauty
8641	esporte_lazer	sports_leisure
8334	moveis_decoracao	furniture_decor
7827	informatica_acessorios	computers_accessories
6964	utilidades_domesticas	housewares
5991	relogios_presentes	watches_gifts
4545	telefonia	telephony
4347	ferramentas_jardim	garden_tools
4235	automotivo	auto

Truncated to displaylimit of 10.

Let’s find the top selling product categories by sales

Code

%%sql 
SELECT SUM(o.price) AS TotalSales, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY TotalSales DESC
LIMIT 10

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

TotalSales	product_category_name	product_category_name_english
1258681.34	beleza_saude	health_beauty
1205005.68	relogios_presentes	watches_gifts
1036988.68	cama_mesa_banho	bed_bath_table
988048.97	esporte_lazer	sports_leisure
911954.32	informatica_acessorios	computers_accessories
729762.49	moveis_decoracao	furniture_decor
635290.85	cool_stuff	cool_stuff
632248.66	utilidades_domesticas	housewares
592720.11	automotivo	auto
485256.46	ferramentas_jardim	garden_tools

Truncated to displaylimit of 10.

As is evident from the list, health and beauty and watches and gifts edge out bed and bath in total sales even though by item count bed and bath is the clear winner. Let’s see which item categories are the most expensive on average, to see if that relates to the list of items with top sales.

Code

%%sql 
SELECT ROUND(AVG(o.price),2) AS average_price, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY average_price DESC
LIMIT 10

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

average_price	product_category_name	product_category_name_english
1098.34	pcs	computers
624.29	portateis_casa_forno_e_cafe	small_appliances_home_oven_and_coffee
476.12	eletrodomesticos_2	home_appliances_2
342.12	agro_industria_e_comercio	agro_industry_and_commerce
281.62	instrumentos_musicais	musical_instruments
280.78	eletroportateis	small_appliances
264.57	portateis_cozinha_e_preparadores_de_alimentos	None
225.69	telefonia_fixa	fixed_telephony
208.99	construcao_ferramentas_seguranca	construction_tools_safety
201.14	relogios_presentes	watches_gifts

Truncated to displaylimit of 10.

It appears that Olist sells many different categories of items that are on average quite expensive when compared with the best-selling items by item-count and sales. Let’s take a visual look at the yearly sales of some of the top selling items (and the aggregate of total sales) to see if perhaps there is a special time of year when most of their sales are made. This could inform marketting decisions. Since marketting can be expensive (paying for TV commercials, radio commercials, and space in newspapers is not trivial), strategic placement can be very important for effectiveness. By taking their best selling items and combining that with the timeframe when people are purchasing from their online store most frequently, they could strategically maximize the benefit of their advertising.

Code

%matplotlib inline

Code

df_orders["month_name"] = df_orders["order_purchase_timestamp"].dt.month_name()

Code

# 1. MERGE TABLES
# We join them on 'order_id' so every item gets a timestamp
df_merged = pd.merge(df_order_items, df_orders, on='order_id')

# 2. CONVERT TO DATETIME
# Ensure the timestamp is an actual date object, not just text
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])

# 3. EXTRACT MONTH DATA
# We create a Number column for sorting (1=Jan) and a Name column for the chart labels
df_merged['month_number'] = df_merged['order_purchase_timestamp'].dt.month
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# 4. GROUP AND AGGREGATE
# We sum the prices for all items falling in "January", "February", etc.
monthly_sales = df_merged.groupby(['month_number', 'month_name'])['price'].sum().reset_index()

# 5. SORT
# Important: Sort by 'month_number' so the chart reads Jan -> Dec
monthly_sales = monthly_sales.sort_values('month_number')

# 6. PLOT THE BAR CHART
plt.figure(figsize=(12, 6))
plt.bar(monthly_sales['month_name'], monthly_sales['price'], color='skyblue')

plt.title('Total Sales by Month (All Years Aggregated)')
plt.xlabel('Month')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=45) # Tilts the text so it fits better

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [Text(0, 0, 'January'),
  Text(1, 0, 'February'),
  Text(2, 0, 'March'),
  Text(3, 0, 'April'),
  Text(4, 0, 'May'),
  Text(5, 0, 'June'),
  Text(6, 0, 'July'),
  Text(7, 0, 'August'),
  Text(8, 0, 'September'),
  Text(9, 0, 'October'),
  Text(10, 0, 'November'),
  Text(11, 0, 'December')])

Unlike I expected, there is not a spike in sales around December, but since I am from the United States where those are special months because of Thanksgiving and Christmas, I guess it is possibly less of a special time for sales in Brazil.

Code

# ---------------------------------------------------------
# STEP 0: RESET THE DATAFRAME
# ---------------------------------------------------------
# We re-create df_merged from scratch so the columns don't get messed up by re-running
df_merged = pd.merge(df_order_items, df_orders, on='order_id')
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# ---------------------------------------------------------
# STEP 1: Add Product Names
# ---------------------------------------------------------
# Add Product info (to get the category name in Portuguese)
df_merged = pd.merge(df_merged, df_products, on='product_id')

# Add English Translation
df_merged = pd.merge(df_merged, df_product_category_name, on='product_category_name')

# ---------------------------------------------------------
# STEP 2: Filter & Prepare
# ---------------------------------------------------------
# Filter for Top 10 categories
top_categories = df_merged['product_category_name_english'].value_counts().head(10).index
df_top = df_merged[df_merged['product_category_name_english'].isin(top_categories)]

# Create Contingency Table
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

contingency_table = pd.crosstab(df_top['month_name'], df_top['product_category_name_english'])
contingency_table = contingency_table.reindex(month_order)

# ---------------------------------------------------------
# STEP 3: Statistical Test & Plot
# ---------------------------------------------------------
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square P-Value: {p}")
if p < 0.05:
    print("CONCLUSION: Statistically significant seasonal patterns detected!")
else:
    print("CONCLUSION: No significant patterns found.")

plt.figure(figsize=(12, 8))

# Normalize by column (axis=0) to see peak months
heatmap_data = contingency_table.div(contingency_table.max(axis=0), axis=1)

# Plotting
sb.heatmap(heatmap_data, cmap="YlGnBu", annot=False, linewidths=.5)
plt.title('Seasonality Heatmap: When does each category sell the most?')
plt.xlabel('Product Category')
plt.ylabel('Month')
plt.yticks(rotation=0) 

plt.show()

Chi-Square P-Value: 9.182505419196234e-297
CONCLUSION: Statistically significant seasonal patterns detected!

So with the deeper statistical analysis we simply confirmed a statistically significant trend that mostly shows between March and August (inclusive) are the months with the greatest sales. Now I’m going to perform a similar analysis but instead of based on item-counts I’m going to do it based on aggregated sales figures.

Code

# --- Data Preparation ---
# Re-merge base dataframes to ensure a clean starting point (avoids duplicate columns from multiple runs)
df_merged = pd.merge(df_order_items, df_orders, on='order_id')
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# Merge product metadata and English translations
df_merged = pd.merge(df_merged, df_products, on='product_id')
df_merged = pd.merge(df_merged, df_product_category_name, on='product_category_name')

# --- Feature Engineering: Top Categories by Revenue ---
# Calculate total revenue per category to identify the top 10 high-value segments
category_revenue = df_merged.groupby('product_category_name_english')['price'].sum()
top_revenue_categories = category_revenue.sort_values(ascending=False).head(10).index

# Filter the main dataset to include only transactions for these top 10 categories
df_top_revenue = df_merged[df_merged['product_category_name_english'].isin(top_revenue_categories)]

# --- Statistical Analysis: Chi-Square Test ---
# Define chronological month order for proper sorting
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

# Create a contingency table (Month vs. Category)
contingency_table = pd.crosstab(df_top_revenue['month_name'], df_top_revenue['product_category_name_english'])
contingency_table = contingency_table.reindex(month_order)

# Perform Chi-Square Test of Independence to detect seasonality
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-Square Test P-Value: {p}")

# --- Visualization: Seasonality Heatmap ---
plt.figure(figsize=(12, 8))

# Normalize by column (axis=0) to highlight peak sales months relative to each category
heatmap_data = contingency_table.div(contingency_table.max(axis=0), axis=1)

sb.heatmap(heatmap_data, cmap="YlGnBu", annot=False, linewidths=.5)
plt.title('Seasonality Heatmap: Peak Revenue Months (Top 10 Categories)')
plt.xlabel('Product Category')
plt.ylabel('Month')
plt.yticks(rotation=0)

plt.show()

Chi-Square Test P-Value: 5.212916281064613e-302

This confirms that the sales and the item counts have very similar patterns, but they are not precisely the same. The item count heat map (the first heat map above) could be used to make inventory decisions for the various product categories and the second heat map (sales-based) could be used to determine spending allocation for various types of advertising. For example, most items should be ordered and stocked in the months of September and October (when sales and order counts are lowest) and advertising should be heaviest in the months of March and August, corresponding to natural purchasing patterns (except for garden tools, which sell the best in November in Brazil).

Code

# --- STEP 1: PREPARE DATA ---
# We need to merge with the 'customers' table to get the 'customer_unique_id'
# (df_merged currently has 'customer_id', which is different for every order)
# Merge unique IDs into main dataframe
df_segmentation = pd.merge(df_merged, df_customers, on='customer_id')

Code

# --- STEP 2: FEATURE ENGINEERING (RFM) ---
# We calculate Recency, Frequency, and Monetary value for each unique customer
max_date = df_segmentation['order_purchase_timestamp'].max()

rfm_data = df_segmentation.groupby('customer_unique_id').agg({
    'order_purchase_timestamp': lambda x: (max_date - x.max()).days, # Recency (days since last order)
    'order_id': 'count',                                             # Frequency (total items bought)
    'price': 'sum'                                                   # Monetary (total spent)
}).reset_index()

# Rename columns for clarity
rfm_data.columns = ['customer_unique_id', 'Recency', 'Frequency', 'Monetary']

Code

# --- STEP 3: MACHINE LEARNING (K-MEANS) ---
# 1. Preprocessing: We use StandardScaler to make Recency (0-300) comparable to Frequency (1-10).
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_data[['Recency', 'Frequency', 'Monetary']])

# 2. Apply K-Means Clustering
# We will ask the model to find 4 distinct groups of customers
kmeans = KMeans(n_clusters=4, random_state=42)
rfm_data['Cluster'] = kmeans.fit_predict(rfm_scaled)

Code

# --- STEP 4: VISUALIZATION ---
# Plot 'Recency' vs 'Monetary' to see the segments
plt.figure(figsize=(12, 8))

sb.scatterplot(
    data=rfm_data, 
    x='Recency', 
    y='Monetary', 
    hue='Cluster',      # Color = Segment
    size='Frequency',   # Size = How often they buy (The missing metric!)
    sizes=(20, 400),    # Control the range of dot sizes (min, max)
    palette='viridis',
    alpha=0.6
)

plt.title('Customer Segments: Recency vs. Monetary (Size = Frequency)')
plt.xlabel('Recency (Days since last purchase)')
plt.ylabel('Monetary (Total Spend)')
plt.ylim(0, 3000) # Limit Y-axis to keep the chart readable
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

plt.show()

Code

# --- STEP 5: INTERPRETATION ---
# Show the average stats for each cluster
print("Cluster Profiles (Who are these people?):")
print(rfm_data.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean())

Cluster Profiles (Who are these people?):
            Recency  Frequency    Monetary
Cluster                                   
0         90.465564   1.092653  117.935052
1        462.831128   1.093289  118.961608
2        231.826464   2.881345  640.670720
3        258.580630   1.091414  114.351480

From this plot we can see that the vast majority of customers order products below about 800 in cost and only purchase one time. Also we can see that there is a group of customers who have ordered very recently. In the plot there are the highest spenders (upper Y-axis), the new potential customers who have recently made purchases, the middle group which might need attention to draw them back onto the online store, and the last group (far right X-axis) that seems to be uninterested in regularly shopping on O-list. It seems the customers who make large purchases also tend to be the customers who are more frequent customers.

Code

%%sql
SELECT 
    c.customer_city,
    c.customer_state,
    COUNT(*) as num_vip_customers
FROM customers c
JOIN (
    SELECT 
        c2.customer_unique_id
    FROM orders o
    JOIN customers c2 ON o.customer_id = c2.customer_id
    GROUP BY c2.customer_unique_id
    -- We use DISTINCT to ensure we are counting 'Visits', not 'Items'
    HAVING COUNT(DISTINCT o.order_id) >= 5
) as vip_shoppers ON c.customer_unique_id = vip_shoppers.customer_unique_id
GROUP BY c.customer_city, c.customer_state
ORDER BY num_vip_customers DESC
LIMIT 5;

Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'

customer_city	customer_state	num_vip_customers
sao paulo	SP	32
praia grande	SP	9
ituiutaba	MG	7
recife	PE	7
santos	SP	7

The Impact of Delivery Time on Reviews

Code

# --- STEP 1: PREPARE THE DATA ---
df_satisfaction = pd.merge(df_orders, df_order_reviews, on='order_id')

# Convert dates to datetime objects (if not already)
df_satisfaction['order_purchase_timestamp'] = pd.to_datetime(df_satisfaction['order_purchase_timestamp'])
df_satisfaction['order_delivered_customer_date'] = pd.to_datetime(df_satisfaction['order_delivered_customer_date'])

Code

# --- STEP 2: NUMPY FEATURE ENGINEERING ---

# We use Numpy to calculate the 'Actual Delivery Time' in days.
# We divide by np.timedelta64(1, 'D') to convert the time difference into a simple number (float).

df_satisfaction['delivery_days'] = (
    df_satisfaction['order_delivered_customer_date'] - df_satisfaction['order_purchase_timestamp']
) / np.timedelta64(1, 'D')

# Drop rows where delivery hasn't happened yet (NaNs) or outliers (> 60 days) for a cleaner chart
df_clean = df_satisfaction.dropna(subset=['delivery_days', 'review_score'])
df_clean = df_clean[df_clean['delivery_days'] < 60]

Code

# --- STEP 3: STATISTICAL INFERENCE (Simple Linear Regression) ---
# Question: Can Delivery Days predict Review Score?
# Formula: Review Score = Intercept + (Coefficient * Delivery Days)

# Define X (Predictor) and y (Target)
X = df_clean['delivery_days']
y = df_clean['review_score']

# Statsmodels requires us to manually add a constant (Intercept) to the equation
X = sm.add_constant(X)

# Fit the Ordinary Least Squares (OLS) model
model = sm.OLS(y, X).fit()

# Print the professional summary table
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           review_score   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                 1.277e+04
Date:                Wed, 03 Dec 2025   Prob (F-statistic):               0.00
Time:                        15:50:49   Log-Likelihood:            -1.5398e+05
No. Observations:               96067   AIC:                         3.080e+05
Df Residuals:                   96065   BIC:                         3.080e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             4.8083      0.007    695.616      0.000       4.795       4.822
delivery_days    -0.0526      0.000   -113.007      0.000      -0.053      -0.052
==============================================================================
Omnibus:                    20048.602   Durbin-Watson:                   2.007
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35689.268
Skew:                          -1.361   Prob(JB):                         0.00
Kurtosis:                       4.228   Cond. No.                         26.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

# --- STEP 4: VISUALIZATION ---
plt.figure(figsize=(10, 6))
sb.regplot(
    x='delivery_days', 
    y='review_score', 
    data=df_clean, 
    scatter_kws={'alpha': 0.05, 'color': 'gray'},
    line_kws={'color': 'red'}
)
plt.title('Impact of Delivery Time on Customer Review Scores')
plt.xlabel('Days to Deliver')
plt.ylabel('Review Score (1-5)')
plt.show()

The obvious trend here is that as delivery days goes up, the rating goes down, but the more important detail is the slope of the line, which indicates that for every 10 days, about half of a rating point is lost. This slope is statistically significant, with a p-value far below 0.05.

In conclusion, the best months to stock merchandise are September and October. The best months to advertise products are March and August. The best city and state to advertise in is São Paulo, São Paulo, and to ensure the best ratings a goal of 10 days or less should be striven for.