Brazilian Ecommerce Dataset

Importing the data into Python variables

Import applicable libraries

Code
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
from scipy.stats import chi2_contingency
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Read the CSV into Pandas dataframes

Code
df_customers = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_customers_dataset.csv")
df_geolocation = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_geolocation_dataset.csv")
df_order_items = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_items_dataset.csv")
df_order_payments = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_payments_dataset.csv")                            
df_order_reviews = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_order_reviews_dataset.csv")
df_orders = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_orders_dataset.csv")
df_products = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_products_dataset.csv")
df_sellers = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/olist_sellers_dataset.csv")
df_product_category_name = pd.read_csv(f"C:/Users/desja/Dropbox/Backup of Jumper Laptop files continuous/Documents/Project Revamp Portfolio/Brazilian Ecommerce Dataset (extra files)/product_category_name_translation.csv")

View the first few rows of the tables to get an idea of what is in them

Code
df_customers.head()
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
0 06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
1 18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
2 4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
3 b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4 4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
Code
df_geolocation.head()
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 1037 -23.545621 -46.639292 sao paulo SP
1 1046 -23.546081 -46.644820 sao paulo SP
2 1046 -23.546129 -46.642951 sao paulo SP
3 1041 -23.544392 -46.639499 sao paulo SP
4 1035 -23.541578 -46.641607 sao paulo SP
Code
df_order_items.head()
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0 00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
1 00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
2 000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
3 00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
4 00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
Code
df_order_payments.head()
order_id payment_sequential payment_type payment_installments payment_value
0 b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
1 a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
2 25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
3 ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
4 42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
Code
df_order_reviews.head()
review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
0 7bc2406110b926393aa56f80a40eba40 73fc7af87114b39712e6da79b0a377eb 4 NaN NaN 2018-01-18 00:00:00 2018-01-18 21:46:59
1 80e641a11e56f04c1ad469d5645fdfde a548910a1c6147796b98fdf73dbeba33 5 NaN NaN 2018-03-10 00:00:00 2018-03-11 03:05:13
2 228ce5500dc1d8e020d8d1322874b6f0 f9e4b658b201a9f2ecdecbb34bed034b 5 NaN NaN 2018-02-17 00:00:00 2018-02-18 14:36:24
3 e64fb393e7b32834bb789ff8bb30750e 658677c97b385a9be170737859d3511b 5 NaN Recebi bem antes do prazo estipulado. 2017-04-21 00:00:00 2017-04-21 22:02:06
4 f7c4243c7fe1938f181bec41a392bdeb 8e6bfb81e283fa7e4f11123a3fb894f1 5 NaN Parabéns lojas lannister adorei comprar pela I... 2018-03-01 00:00:00 2018-03-02 10:26:53
Code
df_orders.head()
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
0 e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
1 53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
2 47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
3 949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
4 ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
Code
df_products.head()
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
0 1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40.0 287.0 1.0 225.0 16.0 10.0 14.0
1 3aa071139cb16b67ca9e5dea641aaa2f artes 44.0 276.0 1.0 1000.0 30.0 18.0 20.0
2 96bd76ec8810374ed1b65e291975717f esporte_lazer 46.0 250.0 1.0 154.0 18.0 9.0 15.0
3 cef67bcfe19066a932b7673e239eb23d bebes 27.0 261.0 1.0 371.0 26.0 4.0 26.0
4 9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37.0 402.0 4.0 625.0 20.0 17.0 13.0
Code
df_sellers.head()
seller_id seller_zip_code_prefix seller_city seller_state
0 3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
1 d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
2 ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
3 c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
4 51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
Code
df_product_category_name.head()
product_category_name product_category_name_english
0 beleza_saude health_beauty
1 informatica_acessorios computers_accessories
2 automotivo auto
3 cama_mesa_banho bed_bath_table
4 moveis_decoracao furniture_decor

Iterate through each dataframe and each column within it to check for empty cells or question marks or NaN

Code
#First store all the dataframes in a dictionary

df_dict = {'df_customers' : df_customers, 'df_geolocation' : df_geolocation, 'df_order_items' : df_order_items, 'df_order_payments' : df_order_payments
           , 'df_order_reviews' : df_order_reviews, 'df_orders' : df_orders, 'df_products' : df_products, 'df_sellers' : df_sellers, 
           'df_product_category_name' : df_product_category_name}


#Iterate through each dataframe in the dictionary

for name, data_frame in df_dict.items():


#Create new tables for columns with empty cells, question marks, or NaN
    
    empty_cells = data_frame.isnull()
    has_question_mark = (data_frame == '?')

#Iterate through these tables and count the total number of "True" values and print these counts if there are any true values
    
    for column in empty_cells.columns.values.tolist():
        if empty_cells[column].any():
            print(f"Table: {name} | Column: {column}")
            print("Empty Cells")
            print (empty_cells[column].value_counts())
            print("")

    for column in has_question_mark.columns.values.tolist():
        if has_question_mark[column].any():
            print(f"Table: {name} | Column: {column}")
            print("Has Question Mark")
            print (has_question_mark[column].value_counts())
            print("")
        
Table: df_order_reviews | Column: review_comment_title
Empty Cells
review_comment_title
True     87656
False    11568
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_message
Empty Cells
review_comment_message
True     58247
False    40977
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_title
Has Question Mark
review_comment_title
False    99223
True         1
Name: count, dtype: int64

Table: df_order_reviews | Column: review_comment_message
Has Question Mark
review_comment_message
False    99221
True         3
Name: count, dtype: int64

Table: df_orders | Column: order_approved_at
Empty Cells
order_approved_at
False    99281
True       160
Name: count, dtype: int64

Table: df_orders | Column: order_delivered_carrier_date
Empty Cells
order_delivered_carrier_date
False    97658
True      1783
Name: count, dtype: int64

Table: df_orders | Column: order_delivered_customer_date
Empty Cells
order_delivered_customer_date
False    96476
True      2965
Name: count, dtype: int64

Table: df_products | Column: product_category_name
Empty Cells
product_category_name
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_name_lenght
Empty Cells
product_name_lenght
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_description_lenght
Empty Cells
product_description_lenght
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_photos_qty
Empty Cells
product_photos_qty
False    32341
True       610
Name: count, dtype: int64

Table: df_products | Column: product_weight_g
Empty Cells
product_weight_g
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_length_cm
Empty Cells
product_length_cm
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_height_cm
Empty Cells
product_height_cm
False    32949
True         2
Name: count, dtype: int64

Table: df_products | Column: product_width_cm
Empty Cells
product_width_cm
False    32949
True         2
Name: count, dtype: int64
Rename misspelled column names in product table
Code
df_products.rename(columns = {"product_name_lenght" : "product_name_length"}, inplace=True)
df_products.rename(columns = {"product_description_lenght" : "product_description_length"}, inplace=True)

Analyze NULL values in table and replace them strategically where appropriate

It is reasonable that the order reviews and title would be left blank or have a question mark.
After reviewing the orders table and the order_reviews table I determined that most of these products that are empties simply were never delivered. The people often left reviews to the effect that they were upset, they left a poor rating for the company, and stated that they never received the product (consistent with the empty order_delivered_customer_date cells). Thus I will simply leave these cells blank. It is not a data entry error, or anything like that.
In cases where the order_approved_at column was empty, it appears the products were out of stock, and this wasn’t always clearly communicated to the online shopper, as at least one was under the impression that their order went through and was upset that they didn’t ever receive their order.
It appears that some replacement is in order in the case of orders that were never approved but were delivered. I used the average “approval time” found in SQLite in hours to find the expected approval time for these NULL values below:
Code
#Ensure the order_purchase_timestamp column is indeed a timestamp 
df_orders['order_purchase_timestamp'] = pd.to_datetime(df_orders['order_purchase_timestamp'])

#Record the average time to approve an order from the order_purchase_timestamp
avg_time_to_approve = 10.4190943019277

#Create my condition
condition = (df_orders['order_status'] == 'delivered') & (df_orders['order_approved_at'].isnull())

#Update entries that meet this condition with an "average"
df_orders.loc[condition, 'order_approved_at'] = (df_orders.loc[condition, 'order_purchase_timestamp'] + pd.to_timedelta(avg_time_to_approve, unit='h'))
In the case of product_category name and all of its NULL values, it appears these are simply miscellaneous products that did not fit under one of the main categories that are already listed. I think the best thing to do in this case is to create a new category called “Miscellaneous” for these
Code
df_products['product_category_name'] = df_products['product_category_name'].fillna('Miscellaneous')
This ensures that there aren’t any product_category_names that are NULL. Since it is evident that the remaining NULL/empty cells in the table are associated with specific products that were not categorized by the company and do not have corresponding data for them, it seems to me that the remaining fields are best left in their NULL state. For example, the quantity of pictures was sometimes NULL for these specific uncategorized/miscellaneous products, but indeed some of the reviews indicated that there were photos online of these products, so simply putting 0 would be inaccurate.

Exploratory analysis

Loop through the dictionary with the dataframes in it and print some basic statistical summary data
Code
df_customers.describe(include = "all")
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
count 99441 99441 99441.000000 99441 99441
unique 99441 96096 NaN 4119 27
top 06b8999e2fba1a1fbc88172c00ba8bc7 8d50f5eadf50201ccdcedfb9e2ac8455 NaN sao paulo SP
freq 1 17 NaN 15540 41746
mean NaN NaN 35137.474583 NaN NaN
std NaN NaN 29797.938996 NaN NaN
min NaN NaN 1003.000000 NaN NaN
25% NaN NaN 11347.000000 NaN NaN
50% NaN NaN 24416.000000 NaN NaN
75% NaN NaN 58900.000000 NaN NaN
max NaN NaN 99990.000000 NaN NaN
Code
df_geolocation.describe(include = "all")
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
count 1.000163e+06 1.000163e+06 1.000163e+06 1000163 1000163
unique NaN NaN NaN 8011 27
top NaN NaN NaN sao paulo SP
freq NaN NaN NaN 135800 404268
mean 3.657417e+04 -2.117615e+01 -4.639054e+01 NaN NaN
std 3.054934e+04 5.715866e+00 4.269748e+00 NaN NaN
min 1.001000e+03 -3.660537e+01 -1.014668e+02 NaN NaN
25% 1.107500e+04 -2.360355e+01 -4.857317e+01 NaN NaN
50% 2.653000e+04 -2.291938e+01 -4.663788e+01 NaN NaN
75% 6.350400e+04 -1.997962e+01 -4.376771e+01 NaN NaN
max 9.999000e+04 4.506593e+01 1.211054e+02 NaN NaN
Code

df_order_items.describe(include = "all")
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
count 112650 112650.000000 112650 112650 112650 112650.000000 112650.000000
unique 98666 NaN 32951 3095 93318 NaN NaN
top 8272b63d03f5f79c56e9e4120aec44ef NaN aca2eb7d00ea1a7b8ebd4e68314663af 6560211a19b47992c3666cc44a7e94c0 2017-07-21 18:25:23 NaN NaN
freq 21 NaN 527 2033 21 NaN NaN
mean NaN 1.197834 NaN NaN NaN 120.653739 19.990320
std NaN 0.705124 NaN NaN NaN 183.633928 15.806405
min NaN 1.000000 NaN NaN NaN 0.850000 0.000000
25% NaN 1.000000 NaN NaN NaN 39.900000 13.080000
50% NaN 1.000000 NaN NaN NaN 74.990000 16.260000
75% NaN 1.000000 NaN NaN NaN 134.900000 21.150000
max NaN 21.000000 NaN NaN NaN 6735.000000 409.680000
Code
df_order_payments.describe(include = "all")
order_id payment_sequential payment_type payment_installments payment_value
count 103886 103886.000000 103886 103886.000000 103886.000000
unique 99440 NaN 5 NaN NaN
top fa65dad1b0e818e3ccc5cb0e39231352 NaN credit_card NaN NaN
freq 29 NaN 76795 NaN NaN
mean NaN 1.092679 NaN 2.853349 154.100380
std NaN 0.706584 NaN 2.687051 217.494064
min NaN 1.000000 NaN 0.000000 0.000000
25% NaN 1.000000 NaN 1.000000 56.790000
50% NaN 1.000000 NaN 1.000000 100.000000
75% NaN 1.000000 NaN 4.000000 171.837500
max NaN 29.000000 NaN 24.000000 13664.080000
Code
df_order_reviews.describe(include = "all")
review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
count 99224 99224 99224.000000 11568 40977 99224 99224
unique 98410 98673 NaN 4527 36159 636 98248
top 7b606b0d57b078384f0b58eac1d41d78 c88b1d1b157a9999ce368f218a407141 NaN Recomendo Muito bom 2017-12-19 00:00:00 2017-06-15 23:21:05
freq 3 3 NaN 423 230 463 4
mean NaN NaN 4.086421 NaN NaN NaN NaN
std NaN NaN 1.347579 NaN NaN NaN NaN
min NaN NaN 1.000000 NaN NaN NaN NaN
25% NaN NaN 4.000000 NaN NaN NaN NaN
50% NaN NaN 5.000000 NaN NaN NaN NaN
75% NaN NaN 5.000000 NaN NaN NaN NaN
max NaN NaN 5.000000 NaN NaN NaN NaN
Code
df_orders.describe(include = "all")
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
count 99441 99441 99441 99441 99281 97658 96476 99441
unique 99441 99441 8 98875 90733 81018 95664 459
top e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2018-04-11 10:48:14 2018-02-27 04:31:10 2018-05-09 15:48:00 2018-05-08 23:38:46 2017-12-20 00:00:00
freq 1 1 96478 3 9 47 3 522
Code
df_products.describe(include = "all")
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
count 32951 32341 32341.000000 32341.000000 32341.000000 32949.000000 32949.000000 32949.000000 32949.000000
unique 32951 73 NaN NaN NaN NaN NaN NaN NaN
top 1e9e8ef04dbcff4541ed26657ea517e5 cama_mesa_banho NaN NaN NaN NaN NaN NaN NaN
freq 1 3029 NaN NaN NaN NaN NaN NaN NaN
mean NaN NaN 48.476949 771.495285 2.188986 2276.472488 30.815078 16.937661 23.196728
std NaN NaN 10.245741 635.115225 1.736766 4282.038731 16.914458 13.637554 12.079047
min NaN NaN 5.000000 4.000000 1.000000 0.000000 7.000000 2.000000 6.000000
25% NaN NaN 42.000000 339.000000 1.000000 300.000000 18.000000 8.000000 15.000000
50% NaN NaN 51.000000 595.000000 1.000000 700.000000 25.000000 13.000000 20.000000
75% NaN NaN 57.000000 972.000000 3.000000 1900.000000 38.000000 21.000000 30.000000
max NaN NaN 76.000000 3992.000000 20.000000 40425.000000 105.000000 105.000000 118.000000
Code
df_sellers.describe(include = "all")
seller_id seller_zip_code_prefix seller_city seller_state
count 3095 3095.000000 3095 3095
unique 3095 NaN 611 23
top 3442f8959a84dea7ee197c632cb2df15 NaN sao paulo SP
freq 1 NaN 694 1849
mean NaN 32291.059451 NaN NaN
std NaN 32713.453830 NaN NaN
min NaN 1001.000000 NaN NaN
25% NaN 7093.500000 NaN NaN
50% NaN 14940.000000 NaN NaN
75% NaN 64552.500000 NaN NaN
max NaN 99730.000000 NaN NaN
Code
df_product_category_name.describe(include = "all")
product_category_name product_category_name_english
count 71 71
unique 71 71
top beleza_saude health_beauty
freq 1 1
Import a SQL library to run some exploratory SQL commands on the database
Code
from sqlalchemy import create_engine

# 1. Load the extension first
%reload_ext sql

# 2. Define your path
db_path = r'C:\Users\desja\Dropbox\Backup of Jumper Laptop files continuous\Documents\Project Revamp Portfolio\Brazilian Ecommerce Dataset\olist.sqlite'

# 3. Create the engine (safest way to handle paths with spaces)
engine = create_engine(f"sqlite:///{db_path}")

# 4. Connect JupySQL to that engine
%sql engine

# 5. List all tables in the database to see their exact names
%sql SELECT name FROM sqlite_master WHERE type='table';
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
name
product_category_name_translation
sellers
customers
geolocation
order_items
order_payments
order_reviews
orders
products
leads_qualified
Truncated to displaylimit of 10.
Order IDs in the orders table are all unique it turns out. This is good. Order IDs in the order_items table are in some cases duplicates because the customer ordered more than one item simultaneously. This also makes sense. See below:
Code
%%sql
SELECT *
FROM order_items
WHERE order_id IN (
    -- Subquery: Find only the order_ids that appear more than once
    SELECT order_id
    FROM order_items
    GROUP BY order_id
    HAVING COUNT(order_id) > 1
)
ORDER BY order_id;
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
0008288aa423d2a3f00fcb17cd7d8719 1 368c6c730842d78016ad823897a372db 1f50f920176fa81dab994f9023523100 2018-02-21 02:55:52 49.9 13.37
0008288aa423d2a3f00fcb17cd7d8719 2 368c6c730842d78016ad823897a372db 1f50f920176fa81dab994f9023523100 2018-02-21 02:55:52 49.9 13.37
00143d0f86d6fbd9f9b38ab440ac16f5 1 e95ee6822b66ac6058e2e4aff656071a a17f621c590ea0fab3d5d883e1630ec6 2017-10-20 16:07:52 21.33 15.1
00143d0f86d6fbd9f9b38ab440ac16f5 2 e95ee6822b66ac6058e2e4aff656071a a17f621c590ea0fab3d5d883e1630ec6 2017-10-20 16:07:52 21.33 15.1
00143d0f86d6fbd9f9b38ab440ac16f5 3 e95ee6822b66ac6058e2e4aff656071a a17f621c590ea0fab3d5d883e1630ec6 2017-10-20 16:07:52 21.33 15.1
001ab0a7578dd66cd4b0a71f5b6e1e41 1 0b0172eb0fd18479d29c3bc122c058c2 5656537e588803a555b8eb41f07a944b 2018-01-04 02:33:42 24.89 17.63
001ab0a7578dd66cd4b0a71f5b6e1e41 2 0b0172eb0fd18479d29c3bc122c058c2 5656537e588803a555b8eb41f07a944b 2018-01-04 02:33:42 24.89 17.63
001ab0a7578dd66cd4b0a71f5b6e1e41 3 0b0172eb0fd18479d29c3bc122c058c2 5656537e588803a555b8eb41f07a944b 2018-01-04 02:33:42 24.89 17.63
001d8f0e34a38c37f7dba2a37d4eba8b 1 e67307ff0f15ade43fcb6e670be7a74c f4aba7c0bca51484c30ab7bdc34bcdd1 2017-05-18 17:35:11 18.99 7.78
001d8f0e34a38c37f7dba2a37d4eba8b 2 e67307ff0f15ade43fcb6e670be7a74c f4aba7c0bca51484c30ab7bdc34bcdd1 2017-05-18 17:35:11 18.99 7.78
Truncated to displaylimit of 10.
Below it is evident that the customer_unique_id helps identify repeat customers, which is great for marketing analysis.
Code
%%sql
SELECT *
FROM customers
WHERE customer_unique_id IN (
    SELECT customer_unique_id
    FROM customers
    GROUP BY customer_unique_id
    HAVING COUNT(*) > 1
)
ORDER BY customer_unique_id;
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
1afe8a9c67eec3516c09a8bdcc539090 00172711b30d52eea8b313a7f2cced02 45200 jequie BA
24b0e2bd287e47d54d193e7bbb51103f 00172711b30d52eea8b313a7f2cced02 45200 jequie BA
1b4a75b3478138e99902678254b260f4 004288347e5e88a27ded2bb23747066c 26220 nova iguacu RJ
f6efe5d5c7b85e12355f9d5c3db46da2 004288347e5e88a27ded2bb23747066c 26220 nova iguacu RJ
49cf243e0d353cd418ca77868e24a670 004b45ec5c64187465168251cd1c9c2f 57055 maceio AL
d95f60d70d9ea9a7fe37c53c931940bb 004b45ec5c64187465168251cd1c9c2f 57035 maceio AL
8ac44e9c15d396b8c3c7cbab0fff4536 0058f300f57d7b93c477a131a59b36c3 41370 salvador BA
f530197ea86ced9488a03d055e118ebf 0058f300f57d7b93c477a131a59b36c3 40731 salvador BA
876356df457f952458a764348e1858bc 00a39521eb40f7012db50455bf083460 72595 brasilia DF
cbb68c721ba9ddb30d8a490cc1897fa1 00a39521eb40f7012db50455bf083460 72595 brasilia DF
Truncated to displaylimit of 10.

Save updated CSV Files to local disk

Data Analysis

Let’s get the top-selling product categories
Code
%%sql 
SELECT COUNT(*) AS ordercount, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY ordercount DESC
LIMIT 10
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
ordercount product_category_name product_category_name_english
11115 cama_mesa_banho bed_bath_table
9670 beleza_saude health_beauty
8641 esporte_lazer sports_leisure
8334 moveis_decoracao furniture_decor
7827 informatica_acessorios computers_accessories
6964 utilidades_domesticas housewares
5991 relogios_presentes watches_gifts
4545 telefonia telephony
4347 ferramentas_jardim garden_tools
4235 automotivo auto
Truncated to displaylimit of 10.
Let’s find the top selling product categories by sales
Code
%%sql 
SELECT SUM(o.price) AS TotalSales, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY TotalSales DESC
LIMIT 10
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
TotalSales product_category_name product_category_name_english
1258681.34 beleza_saude health_beauty
1205005.68 relogios_presentes watches_gifts
1036988.68 cama_mesa_banho bed_bath_table
988048.97 esporte_lazer sports_leisure
911954.32 informatica_acessorios computers_accessories
729762.49 moveis_decoracao furniture_decor
635290.85 cool_stuff cool_stuff
632248.66 utilidades_domesticas housewares
592720.11 automotivo auto
485256.46 ferramentas_jardim garden_tools
Truncated to displaylimit of 10.
As is evident from the list, health and beauty and watches and gifts edge out bed and bath in total sales even though by item count bed and bath is the clear winner. Let’s see which item categories are the most expensive on average, to see if that relates to the list of items with top sales.
Code
%%sql 
SELECT ROUND(AVG(o.price),2) AS average_price, p.product_category_name, t.product_category_name_english
FROM order_items o
LEFT JOIN products p ON p.product_id = o.product_id
LEFT JOIN product_category_name_translation t ON t.product_category_name = p.product_category_name
GROUP BY p.product_category_name
ORDER BY average_price DESC
LIMIT 10
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
average_price product_category_name product_category_name_english
1098.34 pcs computers
624.29 portateis_casa_forno_e_cafe small_appliances_home_oven_and_coffee
476.12 eletrodomesticos_2 home_appliances_2
342.12 agro_industria_e_comercio agro_industry_and_commerce
281.62 instrumentos_musicais musical_instruments
280.78 eletroportateis small_appliances
264.57 portateis_cozinha_e_preparadores_de_alimentos None
225.69 telefonia_fixa fixed_telephony
208.99 construcao_ferramentas_seguranca construction_tools_safety
201.14 relogios_presentes watches_gifts
Truncated to displaylimit of 10.
It appears that Olist sells many different categories of items that are on average quite expensive when compared with the best-selling items by item-count and sales. Let’s take a visual look at the yearly sales of some of the top selling items (and the aggregate of total sales) to see if perhaps there is a special time of year when most of their sales are made. This could inform marketting decisions. Since marketting can be expensive (paying for TV commercials, radio commercials, and space in newspapers is not trivial), strategic placement can be very important for effectiveness. By taking their best selling items and combining that with the timeframe when people are purchasing from their online store most frequently, they could strategically maximize the benefit of their advertising.
Code
%matplotlib inline
Code
df_orders["month_name"] = df_orders["order_purchase_timestamp"].dt.month_name()
Code
# 1. MERGE TABLES
# We join them on 'order_id' so every item gets a timestamp
df_merged = pd.merge(df_order_items, df_orders, on='order_id')

# 2. CONVERT TO DATETIME
# Ensure the timestamp is an actual date object, not just text
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])

# 3. EXTRACT MONTH DATA
# We create a Number column for sorting (1=Jan) and a Name column for the chart labels
df_merged['month_number'] = df_merged['order_purchase_timestamp'].dt.month
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# 4. GROUP AND AGGREGATE
# We sum the prices for all items falling in "January", "February", etc.
monthly_sales = df_merged.groupby(['month_number', 'month_name'])['price'].sum().reset_index()

# 5. SORT
# Important: Sort by 'month_number' so the chart reads Jan -> Dec
monthly_sales = monthly_sales.sort_values('month_number')

# 6. PLOT THE BAR CHART
plt.figure(figsize=(12, 6))
plt.bar(monthly_sales['month_name'], monthly_sales['price'], color='skyblue')

plt.title('Total Sales by Month (All Years Aggregated)')
plt.xlabel('Month')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=45) # Tilts the text so it fits better
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [Text(0, 0, 'January'),
  Text(1, 0, 'February'),
  Text(2, 0, 'March'),
  Text(3, 0, 'April'),
  Text(4, 0, 'May'),
  Text(5, 0, 'June'),
  Text(6, 0, 'July'),
  Text(7, 0, 'August'),
  Text(8, 0, 'September'),
  Text(9, 0, 'October'),
  Text(10, 0, 'November'),
  Text(11, 0, 'December')])

Unlike I expected, there is not a spike in sales around December, but since I am from the United States where those are special months because of Thanksgiving and Christmas, I guess it is possibly less of a special time for sales in Brazil.
Code
# ---------------------------------------------------------
# STEP 0: RESET THE DATAFRAME
# ---------------------------------------------------------
# We re-create df_merged from scratch so the columns don't get messed up by re-running
df_merged = pd.merge(df_order_items, df_orders, on='order_id')
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# ---------------------------------------------------------
# STEP 1: Add Product Names
# ---------------------------------------------------------
# Add Product info (to get the category name in Portuguese)
df_merged = pd.merge(df_merged, df_products, on='product_id')

# Add English Translation
df_merged = pd.merge(df_merged, df_product_category_name, on='product_category_name')

# ---------------------------------------------------------
# STEP 2: Filter & Prepare
# ---------------------------------------------------------
# Filter for Top 10 categories
top_categories = df_merged['product_category_name_english'].value_counts().head(10).index
df_top = df_merged[df_merged['product_category_name_english'].isin(top_categories)]

# Create Contingency Table
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

contingency_table = pd.crosstab(df_top['month_name'], df_top['product_category_name_english'])
contingency_table = contingency_table.reindex(month_order)

# ---------------------------------------------------------
# STEP 3: Statistical Test & Plot
# ---------------------------------------------------------
chi2, p, dof, expected = chi2_contingency(contingency_table)

print(f"Chi-Square P-Value: {p}")
if p < 0.05:
    print("CONCLUSION: Statistically significant seasonal patterns detected!")
else:
    print("CONCLUSION: No significant patterns found.")

plt.figure(figsize=(12, 8))

# Normalize by column (axis=0) to see peak months
heatmap_data = contingency_table.div(contingency_table.max(axis=0), axis=1)

# Plotting
sb.heatmap(heatmap_data, cmap="YlGnBu", annot=False, linewidths=.5)
plt.title('Seasonality Heatmap: When does each category sell the most?')
plt.xlabel('Product Category')
plt.ylabel('Month')
plt.yticks(rotation=0) 

plt.show()
Chi-Square P-Value: 9.182505419196234e-297
CONCLUSION: Statistically significant seasonal patterns detected!

So with the deeper statistical analysis we simply confirmed a statistically significant trend that mostly shows between March and August (inclusive) are the months with the greatest sales. Now I’m going to perform a similar analysis but instead of based on item-counts I’m going to do it based on aggregated sales figures.
Code
# --- Data Preparation ---
# Re-merge base dataframes to ensure a clean starting point (avoids duplicate columns from multiple runs)
df_merged = pd.merge(df_order_items, df_orders, on='order_id')
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])
df_merged['month_name'] = df_merged['order_purchase_timestamp'].dt.month_name()

# Merge product metadata and English translations
df_merged = pd.merge(df_merged, df_products, on='product_id')
df_merged = pd.merge(df_merged, df_product_category_name, on='product_category_name')

# --- Feature Engineering: Top Categories by Revenue ---
# Calculate total revenue per category to identify the top 10 high-value segments
category_revenue = df_merged.groupby('product_category_name_english')['price'].sum()
top_revenue_categories = category_revenue.sort_values(ascending=False).head(10).index

# Filter the main dataset to include only transactions for these top 10 categories
df_top_revenue = df_merged[df_merged['product_category_name_english'].isin(top_revenue_categories)]

# --- Statistical Analysis: Chi-Square Test ---
# Define chronological month order for proper sorting
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 
               'July', 'August', 'September', 'October', 'November', 'December']

# Create a contingency table (Month vs. Category)
contingency_table = pd.crosstab(df_top_revenue['month_name'], df_top_revenue['product_category_name_english'])
contingency_table = contingency_table.reindex(month_order)

# Perform Chi-Square Test of Independence to detect seasonality
chi2, p, dof, expected = chi2_contingency(contingency_table)
print(f"Chi-Square Test P-Value: {p}")

# --- Visualization: Seasonality Heatmap ---
plt.figure(figsize=(12, 8))

# Normalize by column (axis=0) to highlight peak sales months relative to each category
heatmap_data = contingency_table.div(contingency_table.max(axis=0), axis=1)

sb.heatmap(heatmap_data, cmap="YlGnBu", annot=False, linewidths=.5)
plt.title('Seasonality Heatmap: Peak Revenue Months (Top 10 Categories)')
plt.xlabel('Product Category')
plt.ylabel('Month')
plt.yticks(rotation=0)

plt.show()
Chi-Square Test P-Value: 5.212916281064613e-302

This confirms that the sales and the item counts have very similar patterns, but they are not precisely the same. The item count heat map (the first heat map above) could be used to make inventory decisions for the various product categories and the second heat map (sales-based) could be used to determine spending allocation for various types of advertising. For example, most items should be ordered and stocked in the months of September and October (when sales and order counts are lowest) and advertising should be heaviest in the months of March and August, corresponding to natural purchasing patterns (except for garden tools, which sell the best in November in Brazil).
Code
# --- STEP 1: PREPARE DATA ---
# We need to merge with the 'customers' table to get the 'customer_unique_id'
# (df_merged currently has 'customer_id', which is different for every order)
# Merge unique IDs into main dataframe
df_segmentation = pd.merge(df_merged, df_customers, on='customer_id')
Code
# --- STEP 2: FEATURE ENGINEERING (RFM) ---
# We calculate Recency, Frequency, and Monetary value for each unique customer
max_date = df_segmentation['order_purchase_timestamp'].max()

rfm_data = df_segmentation.groupby('customer_unique_id').agg({
    'order_purchase_timestamp': lambda x: (max_date - x.max()).days, # Recency (days since last order)
    'order_id': 'count',                                             # Frequency (total items bought)
    'price': 'sum'                                                   # Monetary (total spent)
}).reset_index()

# Rename columns for clarity
rfm_data.columns = ['customer_unique_id', 'Recency', 'Frequency', 'Monetary']
Code
# --- STEP 3: MACHINE LEARNING (K-MEANS) ---
# 1. Preprocessing: We use StandardScaler to make Recency (0-300) comparable to Frequency (1-10).
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_data[['Recency', 'Frequency', 'Monetary']])

# 2. Apply K-Means Clustering
# We will ask the model to find 4 distinct groups of customers
kmeans = KMeans(n_clusters=4, random_state=42)
rfm_data['Cluster'] = kmeans.fit_predict(rfm_scaled)
Code
# --- STEP 4: VISUALIZATION ---
# Plot 'Recency' vs 'Monetary' to see the segments
plt.figure(figsize=(12, 8))

sb.scatterplot(
    data=rfm_data, 
    x='Recency', 
    y='Monetary', 
    hue='Cluster',      # Color = Segment
    size='Frequency',   # Size = How often they buy (The missing metric!)
    sizes=(20, 400),    # Control the range of dot sizes (min, max)
    palette='viridis',
    alpha=0.6
)

plt.title('Customer Segments: Recency vs. Monetary (Size = Frequency)')
plt.xlabel('Recency (Days since last purchase)')
plt.ylabel('Monetary (Total Spend)')
plt.ylim(0, 3000) # Limit Y-axis to keep the chart readable
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()

plt.show()

Code
# --- STEP 5: INTERPRETATION ---
# Show the average stats for each cluster
print("Cluster Profiles (Who are these people?):")
print(rfm_data.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean())
Cluster Profiles (Who are these people?):
            Recency  Frequency    Monetary
Cluster                                   
0         90.465564   1.092653  117.935052
1        462.831128   1.093289  118.961608
2        231.826464   2.881345  640.670720
3        258.580630   1.091414  114.351480
From this plot we can see that the vast majority of customers order products below about 800 in cost and only purchase one time. Also we can see that there is a group of customers who have ordered very recently. In the plot there are the highest spenders (upper Y-axis), the new potential customers who have recently made purchases, the middle group which might need attention to draw them back onto the online store, and the last group (far right X-axis) that seems to be uninterested in regularly shopping on O-list. It seems the customers who make large purchases also tend to be the customers who are more frequent customers.
Code
%%sql
SELECT 
    c.customer_city,
    c.customer_state,
    COUNT(*) as num_vip_customers
FROM customers c
JOIN (
    SELECT 
        c2.customer_unique_id
    FROM orders o
    JOIN customers c2 ON o.customer_id = c2.customer_id
    GROUP BY c2.customer_unique_id
    -- We use DISTINCT to ensure we are counting 'Visits', not 'Items'
    HAVING COUNT(DISTINCT o.order_id) >= 5
) as vip_shoppers ON c.customer_unique_id = vip_shoppers.customer_unique_id
GROUP BY c.customer_city, c.customer_state
ORDER BY num_vip_customers DESC
LIMIT 5;
Running query in 'sqlite:///C:\\Users\\desja\\Dropbox\\Backup of Jumper Laptop files continuous\\Documents\\Project Revamp Portfolio\\Brazillian Ecommerce Dataset\\olist.sqlite'
customer_city customer_state num_vip_customers
sao paulo SP 32
praia grande SP 9
ituiutaba MG 7
recife PE 7
santos SP 7

The Impact of Delivery Time on Reviews

Code
# --- STEP 1: PREPARE THE DATA ---
df_satisfaction = pd.merge(df_orders, df_order_reviews, on='order_id')

# Convert dates to datetime objects (if not already)
df_satisfaction['order_purchase_timestamp'] = pd.to_datetime(df_satisfaction['order_purchase_timestamp'])
df_satisfaction['order_delivered_customer_date'] = pd.to_datetime(df_satisfaction['order_delivered_customer_date'])
Code
# --- STEP 2: NUMPY FEATURE ENGINEERING ---

# We use Numpy to calculate the 'Actual Delivery Time' in days.
# We divide by np.timedelta64(1, 'D') to convert the time difference into a simple number (float).

df_satisfaction['delivery_days'] = (
    df_satisfaction['order_delivered_customer_date'] - df_satisfaction['order_purchase_timestamp']
) / np.timedelta64(1, 'D')

# Drop rows where delivery hasn't happened yet (NaNs) or outliers (> 60 days) for a cleaner chart
df_clean = df_satisfaction.dropna(subset=['delivery_days', 'review_score'])
df_clean = df_clean[df_clean['delivery_days'] < 60]
Code
# --- STEP 3: STATISTICAL INFERENCE (Simple Linear Regression) ---
# Question: Can Delivery Days predict Review Score?
# Formula: Review Score = Intercept + (Coefficient * Delivery Days)

# Define X (Predictor) and y (Target)
X = df_clean['delivery_days']
y = df_clean['review_score']

# Statsmodels requires us to manually add a constant (Intercept) to the equation
X = sm.add_constant(X)

# Fit the Ordinary Least Squares (OLS) model
model = sm.OLS(y, X).fit()

# Print the professional summary table
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           review_score   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                 1.277e+04
Date:                Wed, 03 Dec 2025   Prob (F-statistic):               0.00
Time:                        15:50:49   Log-Likelihood:            -1.5398e+05
No. Observations:               96067   AIC:                         3.080e+05
Df Residuals:                   96065   BIC:                         3.080e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             4.8083      0.007    695.616      0.000       4.795       4.822
delivery_days    -0.0526      0.000   -113.007      0.000      -0.053      -0.052
==============================================================================
Omnibus:                    20048.602   Durbin-Watson:                   2.007
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35689.268
Skew:                          -1.361   Prob(JB):                         0.00
Kurtosis:                       4.228   Cond. No.                         26.6
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
# --- STEP 4: VISUALIZATION ---
plt.figure(figsize=(10, 6))
sb.regplot(
    x='delivery_days', 
    y='review_score', 
    data=df_clean, 
    scatter_kws={'alpha': 0.05, 'color': 'gray'},
    line_kws={'color': 'red'}
)
plt.title('Impact of Delivery Time on Customer Review Scores')
plt.xlabel('Days to Deliver')
plt.ylabel('Review Score (1-5)')
plt.show()

The obvious trend here is that as delivery days goes up, the rating goes down, but the more important detail is the slope of the line, which indicates that for every 10 days, about half of a rating point is lost. This slope is statistically significant, with a p-value far below 0.05.
In conclusion, the best months to stock merchandise are September and October. The best months to advertise products are March and August. The best city and state to advertise in is São Paulo, São Paulo, and to ensure the best ratings a goal of 10 days or less should be striven for.