GET GUARANTEED SATISFACTION OR MONEY BACK UNDER ICT706 DATA ANALYTICS ASSIGNMENT HELP SERVICES OF EXPERTSMINDS.COM - ORDER TODAY NEW COPY OF THIS ASSIGNMENT!
Research Project:
In this research project you will undertake a data analytics approach to solve a set of business problems that require the use of appropriately selected data processing and mining approaches.
Answer:
In [1]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import seaborn as sns
from scipy import stats
from scipy.stats import kurtosis
from scipy.stats import skew
from sklearn.model_selection import train_test_split
Part A: Load and Clean Data
Part B: Data Exploration
Part C: Predicting Spending Levels
Part D: Predicting Big Spenders
Part E: Business Recommendations
Write Python code to load your dataset into a Pandas DataFrame called 'sales'.
In [2]:
train_data = pd.read_csv('Train_UWu5bXk.csv',header=0)
test_data = pd.read_csv('Test_u94Q5KV.csv',header=0)
In [3]:
train_data.head(10)
In [4]:
train_data.info()
Out[3]:
In [5]:
train_data.describe()
Out[5]:
In [6]:
test_data['Item_Outlet_Sales'] = 0
In [7]:
df = pd.concat([train_data,test_data])
In [8]:
df.head(10)
Out[8]:
In [9]:
df.shape
In [10]:
df.isnull().sum(axis = 0)
Out[9]:
(14204, 12)
Out[10]:
In [11]:
# Fat content
print(df.Item_Fat_Content.unique())
df.loc[df.Item_Fat_Content.isin(['LF','low fat']), 'Item_Fat_Content'] = 'Low Fat'
df.loc[df.Item_Fat_Content.isin(['reg']), 'Item_Fat_Content'] = 'Regular'
print(df.Item_Fat_Content.value_counts())
print(df.Item_Type.unique())
print(df.groupby('Item_Type')['Item_Fat_Content'].count())
df.loc[df.Item_Type.isin(['Health and Hygiene','Household','Others']), 'Item_Fat_Content'
] = 'None'
print(df.Item_Fat_Content.value_counts())
In [12]:
sns.boxplot(df.Item_Type, df.Item_Weight)
plt.xticks(rotation=45)
plt.show()
In [13]:
sns.boxplot(df.Outlet_Identifier, df.Item_Weight)
plt.xticks(rotation=45)
plt.show()
In [14]:
## Out027 and Out019 dont have any identifier associated with them
## Fill missing values in item weight with particular item identifier mean
weights_mean = df.groupby('Item_Identifier',as_index=False).mean()
print(weights_mean.head(5))
In [15]:
df['Item_Weight'] = df.apply(
lambda row: weights_mean.loc[weights_mean['Item_Identifier']==row['Item_Identifier']
,'Item_Weight'] if np.isnan(row['Item_Weight']) else row['Item_Weight'],
axis=1
)
In [16]:
df.isnull().sum(axis = 0)
df.Item_Weight = df.Item_Weight.astype(float)
In [17]:
sns.boxplot(df.Item_Type, df.Item_Weight)
plt.xticks(rotation=45)
plt.show()
In [18]:
df.info()
In [19]:
sns.boxplot(df.Outlet_Identifier, df.Item_Weight)
plt.xticks(rotation=45)
plt.show()
GET BENEFITTED WITH QUALITY ICT706 DATA ANALYTICS ASSIGNMENT HELP SERVICE OF EXPERTSMINDS.COM
In [20]:
df['year'] = 2013 - df.Outlet_Establishment_Year
df = df.drop(['Outlet_Establishment_Year'],axis=1)
In [21]:
df.columns
Out[21]:
Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales', 'year'],
dtype='object')
In [22]:
sns.kdeplot(df.Item_MRP,shade=True)
plt.axvline(x=70,color="blue")
plt.axvline(x=137,color="blue")
plt.axvline(x=210,color="blue")
Out[22]:
<matplotlib.lines.Line2D at 0x2e50ee46630>
In [23]:
### There are four different range of prices. Lets introduce a variable MRP level to acco unt for that.
conditions = [
(df['Item_MRP'] < 70),
(df['Item_MRP'] < 137),
(df['Item_MRP'] < 210),
(df['Item_MRP'] >210)]
choices = ['Low', 'Medium', 'High','Very high']
df['MRP_level'] = np.select(conditions, choices)
In [24]:
df.MRP_level.head(10)
Out[24]:
0 Very high
1 Low
2 High
3 High
4 Low
5 Low
6 Low
7 Medium
8 Medium
9 High
Name: MRP_level, dtype: object
In [25]:
### Missing values in outlet_size
df.Outlet_Identifier.value_counts()
In [26]:
### Outlet 10 & 19 have reported far less data than other supermarkets.
### Let's assume its because they are smaller and have lesser goods to offer.
df.groupby('Outlet_Identifier').agg({'Item_Identifier' : len})
Out[25]:
OUT027 1559
OUT013 1553
OUT035 1550
OUT046 1550
OUT049 1550
OUT045 1548
OUT018 1546
OUT017 1543
OUT010 925
OUT019 880
Name: Outlet_Identifier, dtype: int64
In [27]:
### From the above table it is clear that outlet 10 & 19 are smaller and hence have lesse
r#
## number of items as indicated by the length of item identifiers.
In [28]:
### Boxplot of Sales vs Outlet Identifier
sns.boxplot(train_data.Outlet_Identifier,train_data.Item_Outlet_Sales)
plt.xticks(rotation=45)
plt.show()
In [29]:
### Boxplot of Sales vs Outlet Type
sns.boxplot(train_data.Outlet_Type,train_data.Item_Outlet_Sales)
plt.xticks(rotation=45)
plt.show()
In [30]:
# Sales in the one type 2 supermarket appear a bit low.
# Maybe it's because it's still fairly new, having
# been founded 4 years ago.
In [31]:
### Boxplot of Sales vs Outlet Type
ax = sns.boxplot(x="Outlet_Type", y="Item_Outlet_Sales", data=train_data,hue="Outlet_Siz
e")
ax.set_xticklabels(ax.get_xticklabels(),rotation=45)
ax.legend(loc='upper left')
plt.show()
In [32]:
df.columns
In [33]:
othershops = df.groupby(['Outlet_Identifier','Outlet_Type', 'Outlet_Location_Type', 'Out
let_Size']).agg({'Outlet_Size' : len})
othershops = othershops.add_suffix('_Count').reset_index()
Out[32]:
Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
'Item_Type', 'Item_MRP', 'Outlet_Identifier', 'Outlet_Size',
'Outlet_Location_Type', 'Outlet_Type', 'Item_Outlet_Sales', 'year',
'MRP_level'],
dtype='object')
In [34]:
### Out10 is small
df['Outlet_Size'] = np.where(df['Outlet_Identifier'] == 'OUT010', 'SMALL', df['Outlet_Si
ze'])
In [35]:
### Boxplot of Sales vs Outlet Location Type
ax = sns.boxplot(x="Outlet_Location_Type", y="Item_Outlet_Sales", data=train_data,hue="O
utlet_Size")
ax.set_xticklabels(ax.get_xticklabels(),rotation=45)
ax.legend(loc='upper left')
plt.show()
In [36]:
### Boxplot of Sales vs Item Type
ax = sns.boxplot(x="Item_Type", y="Item_Outlet_Sales", data=train_data,hue="Outlet_Size"
)a
x.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.legend(loc='upper left')
plt.show()
In [37]:
### Boxplot of Sales vs Item Type
ax = sns.boxplot(x="Item_Type", y="Item_Outlet_Sales", data=train_data,hue="Outlet_Type"
)a
x.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.legend(loc='upper left')
plt.show()
In [38]:
### Boxplot of Sales vs Item Type
ax = sns.boxplot(x="Item_Type", y="Item_Visibility", data=df,hue="Outlet_Type")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
ax.legend(loc='upper left')
plt.show()
In [39]:
### Boxplot of Sales vs Item Type
ax = sns.boxplot(x="Outlet_Identifier", y="Item_Visibility", data=df)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
In [40]:
# let's have a look at the item identifiers now,
# there are way too many of them.
##
keeping only the first two letters gives us three groups:
# food, drink and non-food
df['Item_class'] = df['Item_Identifier'].str[0:2]
In [41]:
df['Item_class'].value_counts()
Out[41]:
In [42]:
### Keeping the first three letters gives a higher granularity
df['Item_Identifier'] = df['Item_Identifier'].str[0:2]
In [43]:
df['Item_Identifier'].value_counts()
FD 10201
NC 2686
DR 1317
Name: Item_class, dtype: int64
Out[43]:
FD 10201
NC 2686
DR 1317
Name: Item_Identifier, dtype: int64
In [44]:
newdf = df.select_dtypes(exclude=['object'])
In [45]:
corr = newdf.corr()
# plot the heatmap
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)
In [46]:
# Scatter plot of Item_Outlet_Sales vs Item_MRP
fg = sns.FacetGrid(data=df, hue='Outlet_Type', aspect=1.61)
fg.map(plt.scatter, 'Item_Outlet_Sales', 'Item_MRP').add_legend()
In [47]:
# Scatter plot of Item_Outlet_Sales vs Item_Visibility
fg = sns.FacetGrid(data=df, hue='Outlet_Type', aspect=1.61)
fg.map(plt.scatter, 'Item_Outlet_Sales', 'Item_Visibility').add_legend()
In [48]:
# Scatter plot of Item_Outlet_Sales vs Item_Visibility
fg = sns.FacetGrid(data=df, hue='Outlet_Size', aspect=1.61)
fg.map(plt.scatter, 'Item_Outlet_Sales', 'Item_Visibility').add_legend()
In [49]:
# Scatter plot of Item_Outlet_Sales vs Item_Visibility
fg = sns.FacetGrid(data=df, hue='Outlet_Identifier', aspect=1.61)
fg.map(plt.scatter, 'Item_Outlet_Sales', 'Item_Visibility').add_legend()
SAVE YOUR HIGHER GRADE WITH ACQUIRING ICT706 DATA ANALYTICS ASSIGNMENT HELP & QUALITY HOMEWORK WRITING SERVICES OF EXPERTSMINDS.COM
In [50]:
### Boxplot of Sales vs Item Type
ax = sns.boxplot(x="Item_Type", y="Item_Outlet_Sales", data=df,hue="Outlet_Type")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
In [51]:
### Plenty of Outliers here. We can reduce this by dividing Item_Outlet_Sales by Item_MRP
### Boxplot of Sales vs Item Type
df['Ratio'] = df['Item_Outlet_Sales']/df['Item_MRP']
ax = sns.boxplot(x="Item_Type", y="Ratio", data=df,hue="Outlet_Type")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
In [52]:
ax = sns.barplot(x="Item_Type", y="Ratio", data=df,hue="Outlet_Type")
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
In [53]:
# dividing sales by MRP does reduce the number of outliers
# and also emphasizes the differences between the different
# types of shop
df['Item_Outlet_Sales'] = df['Ratio']
df = df.drop(['Ratio'],axis=1)
In [54]:
# Lets see the ratio of supermarkets to grocery types
df.Outlet_Type.value_counts()
# Although the ratio is too big, a random forest or gbm should be able to deal with this
given enough trees.
In [55]:
# Time to look at the data for each shop separately
def analyze_shop(shop_id):
shopdata = df[df['Outlet_Identifier'].str.contains(shop_id)]
# as Size, location type and type have only one level, we can drop them here
# since the variance of Outlet_Establishment_Year is zero, we
# can also remove that column
shopdata = shopdata.drop(['Outlet_Identifier',
'Outlet_Size',
'Outlet_Location_Type',
'Outlet_Type',
'year'], axis=1)
plt.figure(1)
plt.subplot(221)
sns.distplot(shopdata.Item_Weight)
plt.subplot(222)
sns.distplot(shopdata.Item_Visibility)plt.subplot(223)
sns.distplot(shopdata.Item_MRP)
plt.subplot(224)
sns.distplot(shopdata.Item_Outlet_Sales)
plt.figure(2)
ax = sns.boxplot(x="Item_Type", y="Item_Outlet_Sales", data=shopdata)
ax.set_xticklabels(ax.get_xticklabels(),rotation=90)
plt.show()
In [56]:
analyze_shop('OUT018')
In [57]:
# one hot encoding
cols = df.select_dtypes(include=["object"]).columns
df2 = pd.get_dummies(df, columns=cols, drop_first=True)
In [58]:
# let's resurrect the train and test data sets
new_train = df2[1:train_data.shape[0]]
new_test = df2[-test_data.shape[0]:]
In [59]:
print(new_test.shape)
print(new_train.shape)
(5681, 46)
(8522, 46)
In [60]:
target = new_train.Item_Outlet_Sales
new_train= new_train.drop('Item_Outlet_Sales',axis=1)
In [61]:
new_test = new_test.drop('Item_Outlet_Sales',axis=1)
In [62]:
new_train.to_csv('new_train.csv', sep=',', encoding='utf-8')
new_test.to_csv('new_test.csv', sep=',', encoding='utf-8')
In [63]:
# Data scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
new_train_scaled = pd.DataFrame(scaler.fit_transform(new_train), columns=new_train.colum
ns)
new_test_scaled = pd.DataFrame(scaler.transform(new_test), columns=new_test.columns)
In [64]:
# ensembling of different models
from sklearn.model_selection import cross_val_score
from mlxtend.regressor import StackingRegressor
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from xgboost.sklearn import XGBRegressor
from sklearn.linear_model import BayesianRidge
params = {'n_estimators': 500, 'max_depth': 4, 'min_samples_split': 2,
'learning_rate': 0.01, 'loss': 'ls'}
ridge = Ridge(random_state=1)
gbreg = GradientBoostingRegressor(**params)
bayridge=BayesianRidge()
xgb = XGBRegressor()
streg = StackingRegressor(regressors=[ridge,gbreg,bayridge],
meta_regressor=xgb)
for clf, label in zip([ridge,gbreg,xgb,bayridge,streg], ['Ridge','GBR','XGB','Bayesian Ri
dge','Ensemble']):
scores = cross_val_score(clf, new_train,target, cv=10, scoring='neg_mean_squared_err
or')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
In [65]:
ridge.fit(new_train,target)
prediction1 = ridge.predict(new_test)
In [66]:
gbreg.fit(new_train,target)
prediction2 = gbreg.predict(new_test)
In [67]:
xgb.fit(new_train,target)
prediction3 = xgb.predict(new_test)
In [68]:
bayridge.fit(new_train,target)
prediction4 = bayridge.predict(new_test)
In [69]:
streg.fit(new_train,target)
prediction5 = streg.predict(new_test)
In [70]:
prediction = (0.3*prediction1+0.3*prediction4+0.25*prediction2+0.15*prediction3)*new_test
.Item_MRP
In [71]:
results = test_data[['Item_Identifier','Outlet_Identifier']]
In [72]:
results.is_copy = False
results.loc[:,'Item_Outlet_Sales'] = prediction
In [73]:
#results.to_csv('submission.csv', sep=',', encoding='utf-8', index=False)
GET ASSURED A++ GRADE IN EACH ICT706 DATA ANALYTICS ASSIGNMENT ORDER - ORDER FOR ORIGINALLY WRITTEN SOLUTIONS!
Access our University of the Sunshine Coast Assignment Help Services for its related courses and academic units such as:-
- ICT701 Relational Database Systems Assignment Help
- ICT705 Data and System Integration Assignment Help
- ICT700 Systems Analysis Assignment Help
- PRM701 Project Management Principles Assignment Help
- ICT702 Data Wrangling Assignment Help
- ICT710 ICT Professional Practice and Ethics Assignment Help
- ICT703 Network Technology and Management Assignment Help
- ICT707 Data Science Practice Assignment Help
- ICT706 Machine Learning Assignment Help
- ICT704 Cloud Database Systems Assignment Help