Can the house prices of king county be predicted?

  • This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The data set is made available by kaggle and can be downloaded from the following link.

https://www.kaggle.com/harlfoxem/housesalesprediction

  • We can do the following analysis using the library.

Exploratory Analysis

  • Plot a distribution and box plot for the price variable. Find out if there are any outliers and list them.
  • Find out which variables are highly correlated with 'price'.
  • Find out which zip codes have the highest median house 'price'. Do a bar plot to depict the top 10.
  • Find out if there are any missing values in the dataset and your strategy for imputing those.
  • Do some variables need feature engineering or transformation, if they need to be used to predict prices. Explain what transformations you will apply.

Model Building

  • Build a regression model to predict the prices.

Evaluation

  • Calculate the RMSE and R-Squared value on the test data set.
  • Build the linear model with L1 and L2 regularization parameters. Do a Grid Search for to find optimal values for hyper parameters.

    • Lasso
    • Ridge
  • Draw a diagram to depict RMSE values for different hyper parameters and show the lowest RMSE at optimal value for L1 and L2 parameters.

Models Comparison

  • Build the following models and find the best performing model with lowest RMSE value.
  • Ridge
  • Lasso
  • Elastic Net (With Lasso and Ridge regularizations)
  • Decision Tree
  • Random Forest

Load the dataset

In [1]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
In [2]:
housing_df = sqlContext.read.format("com.databricks.spark.csv")       \
        .options(delimiter=',', header = True, inferSchema = True)  \
        .load('file:///home/hadoop/lab/data/kc_house_data.csv')
In [3]:
housing_df.show(4)
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|
|5631500400|20150225T000000|180000.0|       2|      1.0|        770|   10000|   1.0|         0|   0|        3|    6|       770|            0|    1933|           0|  98028|47.7379|-122.233|         2720|      8062|
|2487200875|20141209T000000|604000.0|       4|      3.0|       1960|    5000|   1.0|         0|   0|        5|    7|      1050|          910|    1965|           0|  98136|47.5208|-122.393|         1360|      5000|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+
only showing top 4 rows

Let's cache the dataset

In [4]:
housing_df.cache()
Out[4]:
DataFrame[id: bigint, date: string, price: double, bedrooms: int, bathrooms: double, sqft_living: int, sqft_lot: int, floors: double, waterfront: int, view: int, condition: int, grade: int, sqft_above: int, sqft_basement: int, yr_built: int, yr_renovated: int, zipcode: int, lat: double, long: double, sqft_living15: int, sqft_lot15: int]
In [5]:
housing_df.printSchema()
root
|-- id: long (nullable = true)
|-- date: string (nullable = true)
|-- price: double (nullable = true)
|-- bedrooms: integer (nullable = true)
|-- bathrooms: double (nullable = true)
|-- sqft_living: integer (nullable = true)
|-- sqft_lot: integer (nullable = true)
|-- floors: double (nullable = true)
|-- waterfront: integer (nullable = true)
|-- view: integer (nullable = true)
|-- condition: integer (nullable = true)
|-- grade: integer (nullable = true)
|-- sqft_above: integer (nullable = true)
|-- sqft_basement: integer (nullable = true)
|-- yr_built: integer (nullable = true)
|-- yr_renovated: integer (nullable = true)
|-- zipcode: integer (nullable = true)
|-- lat: double (nullable = true)
|-- long: double (nullable = true)
|-- sqft_living15: integer (nullable = true)
|-- sqft_lot15: integer (nullable = true)

How many records in the dataset?

In [6]:
housing_df.count()
Out[6]:
21613

Exploring the data

Summary Statistics of price and sqft_living

In [7]:
housing_df.describe("price", "sqft_living").show()
+-------+-----------------+------------------+
|summary|            price|       sqft_living|
+-------+-----------------+------------------+
|  count|            21613|             21613|
|   mean|540088.1417665294|2079.8997362698374|
| stddev|367127.1964827003| 918.4408970468108|
|    min|          75000.0|               290|
|    max|        7700000.0|             13540|
+-------+-----------------+------------------+

  • We can also explicitly ask for specific stats about columns by passing appropriate stats functions
In [8]:
from pyspark.sql.functions import mean, min, max
housing_df.select([max('price'), mean('price'), min('price')] ).show()
+----------+-----------------+----------+
|max(price)|       avg(price)|min(price)|
+----------+-----------------+----------+
| 7700000.0|540088.1417665294|   75000.0|
+----------+-----------------+----------+

In [9]:
housing_df.select([mean('price'), mean('sqft_living')] ).show()
+-----------------+------------------+
|       avg(price)|  avg(sqft_living)|
+-----------------+------------------+
|540088.1417665294|2079.8997362698374|
+-----------------+------------------+

Is there any relationship between waterfront and view?

In [10]:
bed_bath_df = housing_df.select(['waterfront', 'view'] )
In [11]:
bed_bath_df.show(10)
+----------+----+
|waterfront|view|
+----------+----+
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
|         0|   0|
+----------+----+
only showing top 10 rows

Using crosstab function

In [12]:
bed_bath_df.stat.crosstab("view", "waterfront").show()
+---------------+-----+---+
|view_waterfront|    0|  1|
+---------------+-----+---+
|              0|19489|  0|
|              1|  331|  1|
|              2|  955|  8|
|              3|  491| 19|
|              4|  184|135|
+---------------+-----+---+

Mean and standard deviation of prices for houses by different conditions

In [13]:
from pyspark.sql.functions import avg, stddev

price_conditon_avg_df = housing_df.groupBy('condition').agg(avg('price'))
In [14]:
price_conditon_avg_df.show()
+---------+-----------------+
|condition|       avg(price)|
+---------+-----------------+
|        1|334431.6666666667|
|        2|327287.1453488372|
|        3|542012.5781483857|
|        4|521200.3900334566|
|        5|612418.0893592004|
+---------+-----------------+

In [15]:
housing_df.groupBy('condition').agg(stddev('price')).show()
+---------+----------------------+
|condition|stddev_samp(price,0,0)|
+---------+----------------------+
|        1|     271172.8048373091|
|        2|    245418.41321957213|
|        3|     364449.0623431626|
|        4|     358516.2313502135|
|        5|    410971.92253990826|
+---------+----------------------+

Plotting distribution of price

In [16]:
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
:0: FutureWarning: IPython widgets are experimental and may change in the future.
In [17]:
housing_price_pd = housing_df.select('price').toPandas()

Density Plot

In [18]:
sn.set(rc={"figure.figsize": (10, 6)})
sn.distplot(housing_price_pd['price'], norm_hist=True)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f926404bb38>

Box plot

In [19]:
sn.boxplot(x= housing_price_pd['price'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9263efeeb8>

Note:

  • We can observe there are many outliers in the dataset as far as price is concerned

Correlation between price and sqft_living

In [20]:
column_labels = ['price','sqft_living', 'sqft_lot', 'bedrooms','bathrooms', \
         'floors', 'sqft_above', 'sqft_basement','yr_built','yr_renovated', \
        'sqft_living15', 'sqft_lot15']
In [21]:
housing_df.stat.corr( 'price', 'sqft_living' )
Out[21]:
0.7020350546118008

Note:

  • sqft_living is highly correated with price

Which factors are hightly correlated with price

In [22]:
import numpy as np
from pyspark.mllib.stat import Statistics
In [23]:
column_corr = Statistics.corr(housing_df.rdd.map(lambda x:
                         np.array([x['price'],
                                   x['sqft_living'],
                                   x['sqft_lot'],
                                   x['bedrooms'],
                                   x['bathrooms'],
                                   x['floors'],
                                   x['sqft_above'],
                                   x['sqft_basement'],
                                   x['yr_built'],
                                   x['yr_renovated'],
                                   x['sqft_living15'],
                                   x['sqft_lot15']
                                  ])), method='pearson')
In [24]:
sn.heatmap( column_corr, vmin=0,
          vmax=1,
          annot= True,
          xticklabels = column_labels,
          yticklabels = column_labels )
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9263fae860>

Note:

  • The factors sqft_living, number of bathrooms, sqft_living15 and sqft_above seem to be highly correlated with price and can be good predictors for price

Which zipcodes have high median prices

In [25]:
from pyspark.sql.functions import desc

price_by_zipcodes_df = housing_df.groupBy('zipcode').agg(avg('price')).sort( desc( 'avg(price)') )
In [26]:
price_by_zipcodes_df.toPandas()[0:10]
Out[26]:
zipcode avg(price)
0 98039 2160606.600000
1 98004 1355927.082019
2 98040 1194230.021277
3 98112 1095499.342007
4 98102 901258.266667
5 98109 879623.623853
6 98105 862825.231441
7 98006 859684.779116
8 98119 849448.016304
9 98005 810164.875000

Plot top10 zipcodes by price

In [27]:
top_10_zipcodes = price_by_zipcodes_df.toPandas()[0:10]
In [28]:
sn.barplot( data = top_10_zipcodes,
          x='zipcode',
          y='avg(price)',
          order = top_10_zipcodes.zipcode)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9263bf8780>

Any null values

In [29]:
from pyspark.sql.functions import isnull

house_df_clean = housing_df.na.drop( how = 'any' )
In [30]:
house_df_clean.count() == housing_df.count()
Out[30]:
True

Note:

  • Both the row counts are same. Hence no null values in any of the columns.

Feature Engineering

  • Creating or transforming features is a criticial step for building models. For example, if a variable is skewed, it can be log transformed to make it more normal. Also, derive features from existing features, which can explain or predict the response variable.

Log transformation for price variable

  • price variable is right skewed. We can apply log transformation to make it normally distributed.
In [31]:
sn.set(rc={"figure.figsize": (10, 6)})
sn.distplot(housing_price_pd['price'], norm_hist=True)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9261d1e0f0>
In [32]:
from pyspark.sql.functions import col, log
housing_df = housing_df.withColumn( 'log_price', log('price') )
In [33]:
sn.set(rc={"figure.figsize": (10, 6)})
sn.distplot(housing_df.select('log_price').toPandas(), norm_hist=True)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9263bc1ac8>

Log transformation for sqft_living variable

In [34]:
sn.set(rc={"figure.figsize": (8, 4)})
sn.distplot(housing_df.select('sqft_lot').toPandas(), norm_hist=True)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9261997588>
In [35]:
housing_df = housing_df.withColumn( 'log_sqft_lot', log('sqft_lot') )
sn.distplot(housing_df.select('log_sqft_lot').toPandas(), norm_hist=True)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f926172d160>

Correlating log transformed of price and sqft_living

In [36]:
housing_df.stat.corr( 'price', 'sqft_lot' )
Out[36]:
0.0896608605871001
In [37]:
housing_df.stat.corr( 'log_price', 'log_sqft_lot' )
Out[37]:
0.13772713692113375

Calculating age of the house

  • Age of the house may be influencing the price of the house.
  • Assuming 2015 as the base year, how old the house is can be computed and used as a variable to predict the house prices.
In [38]:
from pyspark.sql.functions import lit

housing_df = housing_df.withColumn("age", lit(2015) - col('yr_built'))
In [39]:
housing_df.show( 2 )
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|     log_sqft_lot|age|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686|8.639410824140487| 60|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|13.195613839143922|8.887652690325586| 64|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+
only showing top 2 rows

In [40]:
housing_df.columns
Out[40]:
['id',
'date',
'price',
'bedrooms',
'bathrooms',
'sqft_living',
'sqft_lot',
'floors',
'waterfront',
'view',
'condition',
'grade',
'sqft_above',
'sqft_basement',
'yr_built',
'yr_renovated',
'zipcode',
'lat',
'long',
'sqft_living15',
'sqft_lot15',
'log_price',
'log_sqft_lot',
'age']

When was the house last rennovated?

In [41]:
housing_df = housing_df.withColumn("rennovate_age", lit(2015) - col('yr_renovated'))
In [42]:
housing_df.show(2)
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|     log_sqft_lot|age|rennovate_age|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686|8.639410824140487| 60|         2015|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|13.195613839143922|8.887652690325586| 64|           24|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+
only showing top 2 rows

Keep a copy of the original dataframe for later use

In [43]:
housing_original_df = housing_df

Columns that will be used as features and their types

In [44]:
continuous_features = ['sqft_living', 'bedrooms', 'bathrooms', 'floors',
                    'log_sqft_lot', 'age', 'sqft_above',
                    'sqft_living15', 'sqft_lot15', 'rennovate_age']

categorical_features = ['zipcode', 'waterfront',
                      'grade', 'condition',
                      'view']

Define a function to create categorical features

In [45]:
def create_category_vars( dataset, field_name ):
idx_col = field_name + "Index"
col_vec = field_name + "Vec"

month_stringIndexer = StringIndexer( inputCol=field_name,
                                     outputCol=idx_col )

month_model = month_stringIndexer.fit( dataset )
month_indexed = month_model.transform( dataset )

month_encoder = OneHotEncoder( dropLast=True,
                               inputCol=idx_col,
                               outputCol= col_vec )

return month_encoder.transform( month_indexed )

Encoding all categorical features

In [46]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, PolynomialExpansion, VectorIndexer

for col in categorical_features:
  housing_df = create_category_vars( housing_df, col )

housing_df.cache()
Out[46]:
DataFrame[id: bigint, date: string, price: double, bedrooms: int, bathrooms: double, sqft_living: int, sqft_lot: int, floors: double, waterfront: int, view: int, condition: int, grade: int, sqft_above: int, sqft_basement: int, yr_built: int, yr_renovated: int, zipcode: int, lat: double, long: double, sqft_living15: int, sqft_lot15: int, log_price: double, log_sqft_lot: double, age: int, rennovate_age: int, zipcodeIndex: double, zipcodeVec: vector, waterfrontIndex: double, waterfrontVec: vector, gradeIndex: double, gradeVec: vector, conditionIndex: double, conditionVec: vector, viewIndex: double, viewVec: vector]
In [47]:
housing_df.show( 4 )
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|     log_sqft_lot|age|rennovate_age|zipcodeIndex|     zipcodeVec|waterfrontIndex|waterfrontVec|gradeIndex|      gradeVec|conditionIndex| conditionVec|viewIndex|      viewVec|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686|8.639410824140487| 60|         2015|        45.0|(69,[45],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|13.195613839143922|8.887652690325586| 64|           24|        17.0|(69,[17],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|
|5631500400|20150225T000000|180000.0|       2|      1.0|        770|   10000|   1.0|         0|   0|        3|    6|       770|            0|    1933|           0|  98028|47.7379|-122.233|         2720|      8062|12.100712129872347|9.210340371976184| 82|         2015|        34.0|(69,[34],[1.0])|            0.0|(1,[0],[1.0])|       3.0|(11,[3],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|
|2487200875|20141209T000000|604000.0|       4|      3.0|       1960|    5000|   1.0|         0|   0|        5|    7|      1050|          910|    1965|           0|  98136|47.5208|-122.393|         1360|      5000|13.311329476916953|8.517193191416238| 50|         2015|        44.0|(69,[44],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           2.0|(4,[2],[1.0])|      0.0|(4,[0],[1.0])|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+
only showing top 4 rows

Create vectors from all features column

In [48]:
featureCols = continuous_features + ['zipcodeVec',
                                   'waterfrontVec',
                                   'gradeVec',
                                   'conditionVec',
                                   'viewVec']
In [49]:
featureCols
Out[49]:
['sqft_living',
'bedrooms',
'bathrooms',
'floors',
'log_sqft_lot',
'age',
'sqft_above',
'sqft_living15',
'sqft_lot15',
'rennovate_age',
'zipcodeVec',
'waterfrontVec',
'gradeVec',
'conditionVec',
'viewVec']

Preparing for model building

  • The dataframe need to have two columns: features and label
  • The vector columns need to named as features
  • The target variable need to be named as label
  • Then the dataframe can be directly fed to a model to learn
In [50]:
assembler = VectorAssembler( inputCols = featureCols, outputCol = "features")
In [51]:
assembler.outputCol
Out[51]:
Param(parent='VectorAssembler_4662a220beddbb6d3b19', name='outputCol', doc='output column name.')
In [52]:
housing_train_df = assembler.transform( housing_df )
In [53]:
housing_train_df.show( 5 )
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|     log_sqft_lot|age|rennovate_age|zipcodeIndex|     zipcodeVec|waterfrontIndex|waterfrontVec|gradeIndex|      gradeVec|conditionIndex| conditionVec|viewIndex|      viewVec|            features|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686|8.639410824140487| 60|         2015|        45.0|(69,[45],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|13.195613839143922|8.887652690325586| 64|           24|        17.0|(69,[17],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|
|5631500400|20150225T000000|180000.0|       2|      1.0|        770|   10000|   1.0|         0|   0|        3|    6|       770|            0|    1933|           0|  98028|47.7379|-122.233|         2720|      8062|12.100712129872347|9.210340371976184| 82|         2015|        34.0|(69,[34],[1.0])|            0.0|(1,[0],[1.0])|       3.0|(11,[3],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|
|2487200875|20141209T000000|604000.0|       4|      3.0|       1960|    5000|   1.0|         0|   0|        5|    7|      1050|          910|    1965|           0|  98136|47.5208|-122.393|         1360|      5000|13.311329476916953|8.517193191416238| 50|         2015|        44.0|(69,[44],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           2.0|(4,[2],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|
|1954400510|20150218T000000|510000.0|       3|      2.0|       1680|    8080|   1.0|         0|   0|        3|    8|      1680|            0|    1987|           0|  98074|47.6168|-122.045|         1800|      7503|13.142166004700508|8.997147151515142| 28|         2015|        14.0|(69,[14],[1.0])|            0.0|(1,[0],[1.0])|       1.0|(11,[1],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+
only showing top 5 rows

In [54]:
from pyspark.sql.functions import round

housing_train_df = housing_train_df.withColumn( "label", round('log_price', 4) )
In [55]:
housing_train_df.show( 5 )
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+-------+
|        id|           date|   price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|     log_sqft_lot|age|rennovate_age|zipcodeIndex|     zipcodeVec|waterfrontIndex|waterfrontVec|gradeIndex|      gradeVec|conditionIndex| conditionVec|viewIndex|      viewVec|            features|  label|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+-------+
|7129300520|20141013T000000|221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686|8.639410824140487| 60|         2015|        45.0|(69,[45],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|  12.31|
|6414100192|20141209T000000|538000.0|       3|     2.25|       2570|    7242|   2.0|         0|   0|        3|    7|      2170|          400|    1951|        1991|  98125| 47.721|-122.319|         1690|      7639|13.195613839143922|8.887652690325586| 64|           24|        17.0|(69,[17],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|13.1956|
|5631500400|20150225T000000|180000.0|       2|      1.0|        770|   10000|   1.0|         0|   0|        3|    6|       770|            0|    1933|           0|  98028|47.7379|-122.233|         2720|      8062|12.100712129872347|9.210340371976184| 82|         2015|        34.0|(69,[34],[1.0])|            0.0|(1,[0],[1.0])|       3.0|(11,[3],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|12.1007|
|2487200875|20141209T000000|604000.0|       4|      3.0|       1960|    5000|   1.0|         0|   0|        5|    7|      1050|          910|    1965|           0|  98136|47.5208|-122.393|         1360|      5000|13.311329476916953|8.517193191416238| 50|         2015|        44.0|(69,[44],[1.0])|            0.0|(1,[0],[1.0])|       0.0|(11,[0],[1.0])|           2.0|(4,[2],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|13.3113|
|1954400510|20150218T000000|510000.0|       3|      2.0|       1680|    8080|   1.0|         0|   0|        3|    8|      1680|            0|    1987|           0|  98074|47.6168|-122.045|         1800|      7503|13.142166004700508|8.997147151515142| 28|         2015|        14.0|(69,[14],[1.0])|            0.0|(1,[0],[1.0])|       1.0|(11,[1],[1.0])|           0.0|(4,[0],[1.0])|      0.0|(4,[0],[1.0])|(99,[0,1,2,3,4,5,...|13.1422|
+----------+---------------+--------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+-----------------+---+-------------+------------+---------------+---------------+-------------+----------+--------------+--------------+-------------+---------+-------------+--------------------+-------+
only showing top 5 rows

Split the dataset

In [56]:
seed = 42
In [57]:
train_df, test_df = housing_train_df.randomSplit( [0.7, 0.3], seed = seed )

Build the Linear Regression Model

In [58]:
from pyspark.ml.regression import LinearRegression
In [59]:
linreg = LinearRegression(maxIter=500, regParam=0.0)
In [60]:
lm = linreg.fit( train_df )
In [61]:
lm.intercept
Out[61]:
12.659035366516674
In [62]:
lm.coefficients
Out[62]:
DenseVector([0.0001, -0.0023, 0.0421, -0.0141, 0.0622, 0.0002, 0.0001, 0.0001, -0.0, -0.0, -0.3851, -1.0734, -0.3922, -0.6054, -0.4005, -1.1875, -0.7028, -0.7609, -1.2773, -0.6087, -0.7684, -0.903, -1.0792, -0.8163, -0.6971, -0.4488, -0.7221, -0.6538, -0.9138, -0.6722, -1.2421, -0.6999, -0.6748, -1.2226, -0.5291, -0.868, -0.4616, -0.6288, -0.3544, -0.1371, -0.8413, -0.3905, -0.9461, -0.5911, -0.8259, -0.3848, -1.2209, -1.1924, -1.1682, -0.7845, -0.1781, -1.1465, -1.0939, -0.3606, -0.5432, -1.0678, -1.1831, -0.6539, -0.929, -1.2115, -0.2709, -0.9345, -1.2525, -0.8444, -0.7937, -0.9205, -0.8572, -0.2357, -0.5211, -0.5934, -1.1213, -1.2766, -0.919, -0.9496, -0.2051, -0.2519, -1.0166, -0.8366, -1.0436, -0.4222, 0.3667, 0.4686, 0.567, 0.2477, 0.6132, 0.6345, 0.1117, 0.6644, 0.1023, 0.6471, 0.6004, 0.272, 0.3155, 0.386, 0.1174, -0.2621, -0.1566, -0.0927, -0.1567])

Make predictions on test data and evaluate

In [63]:
y_pred = lm.transform( test_df )
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
In [64]:
y_pred.select( 'features', 'label', 'prediction' ).show( 5 )
+--------------------+-------+------------------+
|            features|  label|        prediction|
+--------------------+-------+------------------+
|(99,[0,1,2,3,4,5,...|13.1956|13.187317690065836|
|(99,[0,1,2,3,4,5,...|12.1007|12.574503618483911|
|(99,[0,1,2,3,4,5,...|13.3113|13.183232371094126|
|(99,[0,1,2,3,4,5,...|12.8992|12.898223910069529|
|(99,[0,1,2,3,4,5,...|12.8866| 12.78774646368769|
+--------------------+-------+------------------+
only showing top 5 rows

Calculate the actual predicted price

In [65]:
from pyspark.sql.functions import exp

y_pred = y_pred.withColumn( "y_pred", exp( 'prediction' ) )

Calculate RMSE

In [66]:
from pyspark.ml.evaluation import RegressionEvaluator
In [67]:
rmse_evaluator = RegressionEvaluator(labelCol="price",
                              predictionCol="y_pred",
                              metricName="rmse" )
In [68]:
lm_rmse = rmse_evaluator.evaluate( y_pred )
In [69]:
lm_rmse
Out[69]:
132953.0815948827

Calculate R-squared

In [70]:
r2_evaluator = RegressionEvaluator(labelCol="price",
                              predictionCol="y_pred",
                              metricName="r2" )
In [71]:
lm_r2 = r2_evaluator.evaluate( y_pred )
In [72]:
lm_r2
Out[72]:
0.8619140817878358

An utility function to get evaluation metrics: R-squared and RMSE Values

In [73]:
def get_r2_rmse( model, test_df ):
  y_pred = model.transform( test_df )
  y_pred = y_pred.withColumn( "y_pred", exp( 'prediction' ) )
  rmse_evaluator = RegressionEvaluator(labelCol="price",
                              predictionCol="y_pred",
                              metricName="rmse" )
  r2_evaluator = RegressionEvaluator(labelCol="price",
                              predictionCol="y_pred",
                              metricName="r2" )

  return [np.round( r2_evaluator.evaluate( y_pred ), 2), np.round( rmse_evaluator.evaluate( y_pred ), 2 )]
In [74]:
perf_params = get_r2_rmse( lm, test_df )
In [75]:
perf_params
Out[75]:
[0.85999999999999999, 132953.07999999999]

Create a dataframe to store all model performances

In [76]:
import pandas as pd

model_perf = pd.DataFrame( columns = ['name', 'rsquared', 'rmse'] )
In [77]:
model_perf = model_perf.append( pd.Series( ["Linear Regression"] + perf_params ,
                 index = model_perf.columns ),
                 ignore_index = True )
In [78]:
model_perf
Out[78]:
name rsquared rmse
0 Linear Regression 0.86 132953.08

Grid Search for Optimal Regularization Parameter

In [79]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
In [80]:
lrModel = LinearRegression(maxIter=50)

LinearRegression takes two paramteres

regParam=0.0, elasticNetParam=0.0

If you are interested in controlling the L1 and L2 penalty separately, keep in mind that this is equivalent to:

a L1 + b L2

where:

regParam = a + b and elasticNetParam = a / (a + b)

When elasticNetParam is 1, b = 0. That means it is a L1 penalty i.e. Lasso Penalty.

Whene elasticNetParam is 0, a = 0. That means it is L2 penalty i.e. Ridge Penalty.

Using Rigde Regression

  • the regParam is a L2 (ridge) penalty, if elastic param is 0
In [81]:
paramGrid = ParamGridBuilder()                          \
  .addGrid(lrModel.regParam, [0.1, 0.01, 0.001])      \
  .addGrid(lrModel.elasticNetParam, [0.0])            \
  .build()
In [82]:
evaluator = RegressionEvaluator(
  metricName="r2",
  labelCol="label",
)
In [83]:
crossval = CrossValidator(estimator=lrModel,
                        estimatorParamMaps=paramGrid,
                        evaluator=evaluator,
                        numFolds=2)  # use 3+ folds in practice
In [84]:
cvModel = crossval.fit( train_df )
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")

Find the best parameters

In [85]:
cvModel.bestModel._java_obj.getRegParam()
Out[85]:
0.001
In [86]:
cvModel.bestModel._java_obj.getElasticNetParam()
Out[86]:
0.0
In [87]:
ridge_perf = get_r2_rmse( cvModel.bestModel, test_df )
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
In [88]:
model_perf = model_perf.append( pd.Series( ["Ridge Regression"] + ridge_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[88]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30

Using Lasso Regression

  • the regParam is a L1 (ridge) penalty, if elastic param is 1.0
In [89]:
paramGrid = ParamGridBuilder()                          \
  .addGrid(lrModel.regParam, [0.1, 0.01, 0.001])      \
  .addGrid(lrModel.elasticNetParam, [1.0])            \
  .build()

evaluator = RegressionEvaluator(
  metricName="r2",
  labelCol="label",
)

crossval = CrossValidator(estimator=lrModel,
                        estimatorParamMaps=paramGrid,
                        evaluator=evaluator,
                        numFolds=2)  # use 3+ folds in practice
In [90]:
cvModel = crossval.fit( train_df )
cvModel.bestModel._java_obj.getElasticNetParam()
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
Out[90]:
1.0
In [91]:
cvModel.bestModel._java_obj.getRegParam()
Out[91]:
0.001
In [92]:
lasso_perf = get_r2_rmse( cvModel.bestModel, test_df )
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
In [93]:
model_perf = model_perf.append( pd.Series( ["Lasso Regression"] + lasso_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[93]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30
2 Lasso Regression 0.84 141895.41

Function to create best model using Grid parameters and Cross Validation Strategy

In [94]:
def getBestModel( paramGrid, lModel, train, test ):

  evaluator = RegressionEvaluator(
      metricName="r2",
      labelCol="label",
  )

  crossval = CrossValidator(estimator=lModel,
                        estimatorParamMaps=paramGrid,
                        evaluator=evaluator,
                        numFolds=2)  # use 3+ folds in practice

  cvModel = crossval.fit( train )
  r2, rmse = get_r2_rmse( cvModel.bestModel, test )

  print( "RMSE: ", np.round( rmse, 2 ) )
  print( "R-Squared: ", np.round( r2, 2 ) )

  return cvModel, rmse, r2

Elastic Net Regression

In [95]:
lModel = LinearRegression(maxIter=50)

enetParamGrid = ParamGridBuilder()                     \
  .addGrid(lModel.regParam, [0.1, 0.01])             \
  .addGrid(lModel.elasticNetParam, [0.2, 0.5])       \
  .build()

train_df, test_df = housing_train_df.randomSplit( [0.7, 0.3], seed = seed )
In [96]:
enetModel, rmse, r2 = getBestModel( enetParamGrid,
                                 lModel,
                                 train_df,
                                 test_df )
RMSE:  142570.89
R-Squared:  0.84
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
In [97]:
enet_perf = get_r2_rmse( enetModel.bestModel, test_df )
In [98]:
model_perf = model_perf.append( pd.Series( ["ElasticNet Regression"] + enet_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[98]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30
2 Lasso Regression 0.84 141895.41
3 ElasticNet Regression 0.84 142570.89

Decision Tree Regressor

  • Using a decision tree regressor with maxdept = 6
In [99]:
from pyspark.ml.regression import DecisionTreeRegressor, RandomForestRegressor

treeModel = DecisionTreeRegressor(featuresCol="features",
                              labelCol="label",
                              maxDepth=6)
In [100]:
tlm = treeModel.fit( train_df )
In [101]:
tree_perf = get_r2_rmse( tlm, test_df )
In [102]:
model_perf = model_perf.append( pd.Series( ["Decistion Tree"] + tree_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[102]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30
2 Lasso Regression 0.84 141895.41
3 ElasticNet Regression 0.84 142570.89
4 Decistion Tree 0.66 209512.43

Random Forest Regressor

  • Random Forest regressor to create 100 decision trees, which will use one third of num features available and maximum depth for each tree will be 10.
  • featureSubsetupported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]
  • Details of the API for RandomForestRegressor is available here
In [103]:
rfModel = RandomForestRegressor(featuresCol="features",
                              labelCol="label",
                              numTrees=100,
                              maxDepth=10,
                              featureSubsetStrategy='onethird')
In [104]:
rflm = rfModel.fit( train_df )
In [105]:
rf_perf = get_r2_rmse( rflm, test_df )
In [106]:
model_perf = model_perf.append( pd.Series( ["Random Forest Tree"] + rf_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[106]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30
2 Lasso Regression 0.84 141895.41
3 ElasticNet Regression 0.84 142570.89
4 Decistion Tree 0.66 209512.43
5 Random Forest Tree 0.74 180826.21

Gradient Boosted Trees

  • API Details are available here
In [107]:
from pyspark.ml.regression import GBTRegressor

gbtModel = GBTRegressor(featuresCol="features",
                 labelCol="label",
                 maxIter=20,
                 maxDepth=6,
                 maxBins = 10)

gblm = gbtModel.fit( train_df )
In [108]:
gbt_perf = get_r2_rmse( gblm, test_df )
In [109]:
model_perf = model_perf.append( pd.Series( ["Gradient Boosted Tree"] + gbt_perf ,
                 index = model_perf.columns ),
                 ignore_index = True )

model_perf
Out[109]:
name rsquared rmse
0 Linear Regression 0.86 132953.08
1 Ridge Regression 0.86 133851.30
2 Lasso Regression 0.84 141895.41
3 ElasticNet Regression 0.84 142570.89
4 Decistion Tree 0.66 209512.43
5 Random Forest Tree 0.74 180826.21
6 Gradient Boosted Tree 0.74 183886.19

Conclusion:

The best model performance is Linear Regression with .86 R-Squared value and minimum RMSE Values.

Creating Pipeline

MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow.

  • More explanation avialable here
In [110]:
continuous_features
Out[110]:
['sqft_living',
'bedrooms',
'bathrooms',
'floors',
'log_sqft_lot',
'age',
'sqft_above',
'sqft_living15',
'sqft_lot15',
'rennovate_age']
In [111]:
categorical_features
Out[111]:
['zipcode', 'waterfront', 'grade', 'condition', 'view']
In [112]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression

## Create indexers for the categorical features
indexers = [
StringIndexer(inputCol=c, outputCol="{}_idx".format(c)) for c in categorical_features]

## encode the categorical features
encoders = [
  OneHotEncoder(
      inputCol=idx.getOutputCol(),
      outputCol="{0}_enc".format(idx.getOutputCol())) for idx in indexers]

## Create vectors for all features categorical and continuous

assembler = VectorAssembler(
  inputCols=[enc.getOutputCol() for enc in encoders] + continuous_features,
  outputCol="features")

## Initialize the linear model
lrModel = LinearRegression( maxIter = 10 )


## Create the pipeline with sequence of activities
pipeline = Pipeline(
  stages=indexers + encoders + [assembler, lrModel ])
In [113]:
housing_pipeline_df = housing_original_df.withColumn( 'label', round( log( 'price' ), 4) )
In [114]:
training, testing = housing_pipeline_df.randomSplit( [0.7, 0.3], seed = seed )
In [115]:
training.show(5)
+----------+---------------+---------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+------------------+---+-------------+-------+
|        id|           date|    price|bedrooms|bathrooms|sqft_living|sqft_lot|floors|waterfront|view|condition|grade|sqft_above|sqft_basement|yr_built|yr_renovated|zipcode|    lat|    long|sqft_living15|sqft_lot15|         log_price|      log_sqft_lot|age|rennovate_age|  label|
+----------+---------------+---------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+------------------+---+-------------+-------+
|7129300520|20141013T000000| 221900.0|       3|      1.0|       1180|    5650|   1.0|         0|   0|        3|    7|      1180|            0|    1955|           0|  98178|47.5112|-122.257|         1340|      5650|12.309982108920686| 8.639410824140487| 60|         2015|  12.31|
|1954400510|20150218T000000| 510000.0|       3|      2.0|       1680|    8080|   1.0|         0|   0|        3|    8|      1680|            0|    1987|           0|  98074|47.6168|-122.045|         1800|      7503|13.142166004700508| 8.997147151515142| 28|         2015|13.1422|
|7237550310|20140512T000000|1225000.0|       4|      4.5|       5420|  101930|   1.0|         0|   0|        3|   11|      3890|         1530|    2001|           0|  98053|47.6561|-122.005|         4760|    101930|14.018451401960965|11.532041582162458| 14|         2015|14.0185|
|1321400060|20140627T000000| 257500.0|       3|     2.25|       1715|    6819|   2.0|         0|   0|        3|    7|      1715|            0|    1995|           0|  98003|47.3097|-122.327|         2238|      6819|12.458774999085929| 8.827468112520654| 20|         2015|12.4588|
|2008000270|20150115T000000| 291850.0|       3|      1.5|       1060|    9711|   1.0|         0|   0|        3|    7|      1060|            0|    1963|           0|  98198|47.4095|-122.315|         1650|      9711|12.583995250631936| 9.181014542594355| 52|         2015| 12.584|
+----------+---------------+---------+--------+---------+-----------+--------+------+----------+----+---------+-----+----------+-------------+--------+------------+-------+-------+--------+-------------+----------+------------------+------------------+---+-------------+-------+
only showing top 5 rows

In [116]:
model = pipeline.fit( training )
In [117]:
get_r2_rmse( model, testing )
/home/hadoop/lab/software/spark-1.6.0-bin-hadoop2.6/python/pyspark/ml/regression.py:123: UserWarning: weights is deprecated. Use coefficients instead.
warnings.warn("weights is deprecated. Use coefficients instead.")
Out[117]:
[0.85999999999999999, 132953.07999999999]