Airbnb Price Recommende

Cleaning

Cleaning done to the original data

First of all, we had to select the features to consider for our model. We considered all the features that could be relevant for a flat that is about to be put on the market, i.e.:

zipcode
latitude
longitude
neighbourhood_cleansed
host_about
host_verifications
host_listings_count

host_identity_verified
availability_30
availability_60
availability_90
availability_365
cancellation_policy
property_type

room_type
bathrooms
bedrooms
beds
bed_type
amenities

accommodates
guests_included
extra_people
cleaning_fee
security_deposit
price

Format conversions

We had to convert some of the features to a numeric format. Most importantly, the price feature:

						data.loc[:,('price')] = data.loc[:,('price')].map(lambda x:x.replace('$','').replace(',',''))

We decided to turn the feature "security_deposit" from a numerical value to 0 or 1. In this way we can make use of this feature by only asking the users of our application if they want or not to get a security deposit, without the need of specifying the amount. The same treatment was applied to the feature "host_about", which is the host description, thus retaining only the information of whether the user has uploaded some info on themselves or not.

Also, the ‘cancellation_policy’ options, i.e., ‘strict’,’moderate’ and ‘flexible’, were mapped into numbers.

						for feature in data.columns: # Loop through all columns in the dataframe
							if data[feature].dtype == 'object': # Only apply for columns with categorical strings
								data[feature] = pd.Categorical(data[feature]).codes # Replace strings with an integer

Feature split and encoding

One hot encoding was performed over the features "neighbourhood_cleansed", "property_type", "room_type" and "bed_type". In this way we obtained 73, 3, 16, and 5 new features respectively.

						one_hot = pd.get_dummies(data['neighbourhood_cleansed'])
						data = data.drop('neighbourhood_cleansed', axis=1)
						data = data.join(one_hot)

In order to perform the one hot encoding over the "amenities" (wifi, tv, parking, etc..) and "host_verifications” (facebook, linkedin etc..) features we first had to split the columns regarding each amenity/ host verification option.

						def getColumnInfoTV(data):
							aux = cleanAmenities(data)
							if 'TV' in aux:
								return 1
							return 0

						data.loc[:,("TV")] = data.amenities.apply(getColumnInfoTV)

Finding the “real” price

Finally, while working with this dataset we realized that often a high cleaning fee was associated to a lower property price. This could be a “strategy” by some users that want to “attract” possible clients with a lower price, but then gain more money with a high cleaning fee. In order not to be biased by this possibility, we decided to predict the full price, including the cleaning fee. Thus we sum the features “price” and “cleaning fee” to get the final price feature for our model.

Airbnb Price Recommender

Cleaning

Get in touch