Cleaning

First of all, we had to select the features to consider for our model. We considered all the features that could be relevant for a flat that is about to be put on the market, i.e.:


  • zipcode
  • latitude
  • longitude
  • neighbourhood_cleansed
  • host_about
  • host_verifications
  • host_listings_count
  • host_identity_verified
  • availability_30
  • availability_60
  • availability_90
  • availability_365
  • cancellation_policy
  • property_type
  • room_type
  • bathrooms
  • bedrooms
  • beds
  • bed_type
  • amenities
  • accommodates
  • guests_included
  • extra_people
  • cleaning_fee
  • security_deposit
  • price

We had to convert some of the features to a numeric format. Most importantly, the price feature:

						data.loc[:,('price')] = data.loc[:,('price')].map(lambda x:x.replace('$','').replace(',',''))
					

We decided to turn the feature "security_deposit" from a numerical value to 0 or 1. In this way we can make use of this feature by only asking the users of our application if they want or not to get a security deposit, without the need of specifying the amount. The same treatment was applied to the feature "host_about", which is the host description, thus retaining only the information of whether the user has uploaded some info on themselves or not.

Also, the ‘cancellation_policy’ options, i.e., ‘strict’,’moderate’ and ‘flexible’, were mapped into numbers.

						for feature in data.columns: # Loop through all columns in the dataframe
							if data[feature].dtype == 'object': # Only apply for columns with categorical strings
								data[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer
					

One hot encoding was performed over the features "neighbourhood_cleansed", "property_type", "room_type" and "bed_type". In this way we obtained 73, 3, 16, and 5 new features respectively.

						one_hot = pd.get_dummies(data['neighbourhood_cleansed'])
						data = data.drop('neighbourhood_cleansed', axis=1)
						data = data.join(one_hot)
					

In order to perform the one hot encoding over the "amenities" (wifi, tv, parking, etc..) and "host_verifications” (facebook, linkedin etc..) features we first had to split the columns regarding each amenity/ host verification option.

						def getColumnInfoTV(data):
							aux = cleanAmenities(data)
							if 'TV' in aux:
								return 1
							return 0

						data.loc[:,("TV")] = data.amenities.apply(getColumnInfoTV)
					

Finally, while working with this dataset we realized that often a high cleaning fee was associated to a lower property price. This could be a “strategy” by some users that want to “attract” possible clients with a lower price, but then gain more money with a high cleaning fee. In order not to be biased by this possibility, we decided to predict the full price, including the cleaning fee. Thus we sum the features “price” and “cleaning fee” to get the final price feature for our model.