You want to get the top rating for your AirBnB listings? Use these words!

J Song
6 min readApr 19, 2021
Photo by Michael Browning on Unsplash

Introduction

We can easily come across many marketing articles and blog posts claiming substantial booking increases by following their formula to write listing titles and descriptions.

This post will analyze Airbnb listings data to compare the words used in the top-rated listings with ones used in worst-rated listings using publicly available Airbnb listing data for the Boston area.

Questions we are going to answer

1. What are the most frequently used ten words in Titles and Descriptions in Airbnb listings.

2. What are the most frequently used ten words in Titles and Descriptions for top-rated Airbnb listings.

3. What are the most frequently used ten words in Titles and Descriptions for poorly rated AirBnB listings

Data Exploration

There are three data sets for this project. These data sets are available at Kaggle.com/airbnb/boston. We read them into the following three DataFrames, calendar, listings, and reviews.

  • calendar contains availablity and price informaiton about AirBnB listings.
calendar DataFrame
  • listings contains title, description and various other information about AirBnB listings. It has 95 columns about listings. It is the primary data set because it includes most of the information for this project including the name (listing’s title) description (listing’s description) , review_scores_rating (listing’s rating from reviews), etc.
listings DataFrame
  • reviews contains customer review related information about AirBnB listsings.
reviews DataFrame

Data Analsis

First, we will find out the most frequently used ten words in title and description of all listings.

We will read in all words in title and description in all listings and remove the stopwords which don’t have much meaning.

# Combining all  
preprocessed_description = []
preprocessed_name = []
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"])
#: remove nan first
ser_description = df_listing['description'].dropna()
ser_name = df_listing['name'].dropna()
for sentance in tqdm(ser_description.values):
str = "123456790abcdefABCDEF!@#$%^&*()_+<>?,./"
sentance = re.sub(r'[^a-zA-Z]', " ", sentance)
sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
preprocessed_description.append(sentance.strip())

Then we can use the FreqDist function in the NLTK library to compute frequency for each word.

allWords = nltk.tokenize.word_tokenize(allDescriptionText)
allWordDist = nltk.FreqDist(w for w in allWords)
allWordDist

Now we can sort the words by frequency and print the top ten words. Here are the top ten words used in the description and the title for all listings.

#: description
sortedWord = sorted(allWordDist.items(), key=lambda x: x[1], reverse=True)
sortedWord[:10]
[('boston', 5323),
('room', 3670),
('apartment', 3282),
('kitchen', 2768),
('walk', 2667),
('bedroom', 2481),
('bed', 2196),
('living', 1957),
('located', 1825),
('street', 1791)]
#: title
sortedWord4Name = sorted(allWordDist4Name.items(), key=lambda x: x[1], reverse=True)
sortedWord4Name[:10]
[('boston', 715),
('room', 629),
('bedroom', 362),
('private', 354),
('end', 354),
('apt', 329),
('apartment', 310),
('near', 270),
('cozy', 261),
('studio', 248)]

We can plot the list to visualize the result.

Next, we will find out the most frequently used ten words in the title and description of the top-rated listings.

The top-rated means the listings with the highest review scores which is 100. We can read in titles and descriptions from the listings that have review_score_rating ==100. The number of listings included is 3516.

ser_description_top = df_listing[(df_listing['review_scores_rating']==100) & (df_listing['number_of_reviews']>1) ]['description'].dropna()
ser_description_top

Then we can follow the same procedure above to remove those stop words and rank words by frequency.

Last, we will find out the most frequently used ten words in the title and description of the poorly rated listings.

The poorly rated indicates the listings with review score 60 or below out of 100. We can read in titles and descriptions from the listings that have review_score_rating < 60. The number of listings included in the analysis is 3484 which is close to the number of top-rated listings above.

ser_description_bot = df_listing[(df_listing['review_scores_rating']<60) & (df_listing['number_of_reviews']>1) ]['description'].dropna()
ser_description_bot

Again, we can follow the same procedure above to remove those stop words and rank words by frequency.

Results

These are the ten most frequently used words in Title and Description of listings in the Boston Area

Title ['boston',
'room',
'bedroom',
'private',
'end',
'apt',
'apartment',
'near',
'cozy',
'studio']

Description
['boston',
'room',
'apartment',
'kitchen',
'walk',
'bedroom',
'bed',
'living',
'located',
'street']

of the top-rated listings

Title['boston',
'room',
'end',
'south',
'private',
'near',
'sunny',
'cozy',
'location',
'w']
Description['boston',
'room',
'bedroom',
'apartment',
'kitchen',
'private',
'bed',
'living',
'walk',
'home']

of the poorly rated listings

Title['umass',
'mgh',
'longwood',
'city',
'bcec',
'apartment',
'cozy',
'train',
'private',
'room']
Description['mins',
'room',
'train',
'station',
'apartment',
'kitchen',
'boston',
'minutes',
'free',
'access']

Conclusion

After examing those ten most frequent words used in the title and description of all listings, the top-rated listings, and the poorly rated listings, I made the following observations

  1. Six out of ten overlapping words are in the titles of all listings and the top-rated listings.
  2. Seven out of ten overlapping words are in the descriptions of all listings and the top-rated listings.
  3. Four out of ten overlapping words are in the titles of all listings and the poorly rated listings
  4. Four out of ten overlapping words are in the descriptions of all listings and the poorly rated listings

In general, I can see more specific/unique words used in the poorly rated listings. For example, words indicating particular parts of the city, such as MGH, Longwood, and terms that specify a time, such as mins, are used.

According to the analysis above, when you write a listing post, don’t sweat about writing precise information in the posting; instead, use the words most commonly used in all listings.

I can speculate why those listings with very specifics terms used got poor ratings. Assuming all the property conditions are relatively similar (I know this is a BIG assumption), having unique, specific descriptive terms in the listing may give potential customers different expectations than what the owner originally intended.

--

--