Critically reflect on the compliance of recommender system

Assignment Help Python Programming

Reference no: EM132881167

Question 1

In this question you investigate how well nightly price can be predicted from the other variables in the dataframe. You need to decide yourself which variables to choose, but make sure you have at least 10 variables. The only requirement is that you use room_typeas a predictor. Because room_typeis a categorical variable, you first have to use dummy coding to turn it into a number of binary variables (hint: pd.get_dummies()). In the notebook, provide a short explanation for your choice of variables.

Starting from the variables you have chosen, our goal is to derive a sparse model with fewer variables. This process is called variable selection. In variable selection (‘variable' means the same as ‘predictor'), variables get iteratively added or removed from the regression model. Once finished, the model typically contains only a subset of the original variables. It makes it easier to interpret the model, and in some cases it makes it generalise better to new data. To perform variable selection, implement a function variable_selection(df, predictors, target, alpha)

where df is the listings dataframe, predictors is a list with your initial selection of at least 10 variables as follow ['neighbourhood_cleansed', 'property_type', 'room_type', 'accommodates','bedrooms', 'beds', 'minimum_nights', 'maximum_nights','availability_365', 'number_of_reviews','review_scores_location']
target is the target variable for the regression (e.g. ‘price'), and alpha is the significance level for selecting significant predictors (e.g. 0.05). The function returns pred, the selected subset of the original predictors.

To calculate regression fits and p-values you can use statsmodels. Your approach operates in two stages: In stage 1, you build a model by adding variables one after the other. You keep adding variables that increase the adjusted R2 coefficient. You do not need to calculate it by hand, it is provided by statsmodelspackage. In stage 2, starting from these variables, if any of them are not significant, you keep removing variables until all variables in the model are significant. The output of the second stage is your final set of variables. Let us look at the two stages in detail:

Stage 1 (add variables)
• • Start with an empty set of variables
• • Fit multiple one-variable regression models. In each iteration, use one of the variables provided in predictors. The variable that leads to the largest increase in adjusted R2 is added to the model.
• • Now proceed by adding a second variable into the model. Starting from the remaining variables, again choose the variable that leads to the largest increase in adjusted R2.
• • Continue in the same way for the third, fourth, ... variable.
• • You are finished when there is no variable left that increases adjusted R2.

Stage 2 (remove non-significant variables)
It is possible that some of the variables from the previous stage are not significant. We call a variable "significant" if the p-value of its coefficient is smaller or equal to the given threshold alpha.
• • Start by fitting a model using the variables that have been added to the model in Stage 1.
• • If there is a variable that is not significant, remove the variable with the largest p-value and fit the model again with the reduced set of variables.
• • Keep removing variables and re-fitting the model until all remaining variables are significant.
• • The remaining significant variables are the output of your function.

To solve this question, provide corresponding code in Question 2b in the notebook and provide a short answer in the space following YOUR ANSWER. To test your function, add a function call with your selection of predictors and alpha level.

QUESTION 2:

There have been requests from customers to provide automated recommendations. First, guests requested a recommender system that helps them identifying neighbourhoods in a city that fit their budget. Second, hosts who want to offer new listings would like a recommender system that suggests a nightly price. As a response to this request, you and your team have worked out specifications for two recommender systems.

Recommender system 1: Recommend a neighbourhood given a budget
Guests who are traveling on a budget have been requesting a tool that allows them to quickly see which neighbourhood in a city offers most accommodation opportunities within their budget bracket. You want to implement a Python function that delivers this functionality. The plan is to integrate the Python function into the Airbnb website. After some deliberation, you work out that the function should meet these specifications:
• • The user should be able to specify their budget bracket, that is, a minimum and maximum budget. For instance, a user might look for properties priced in the $10-$50 range. Another user looking for luxury accommodation may opt for a $100-$500 range.
• • Your function identifies which neighbourhood has the largest number of properties within this range. It returns a string representing the name of the neighbourhood. Use the neighbourhood_cleansedvariable for the names of the neighbourhoods.
• • Some neighbourhoods have more properties than others, so by considering only absolute numbers neighbourhoods with more properties could always be preferred by your algorithm. An alternative is to consider relative numbers, that is, the proportion of listings within the budget bracket relative to the total number of listings within a given neighbourhood. The user should be able to select whether they want absolute or relative numbers.

From these specifications, you arrive at the following function signature:
recommend_neighbourhood(df, budget_min, budget_max, relative)
with df being your listings dataframe, the variables budget_minand budget_maxbeing floating point numbers representing the budget bracket. The numbers are inclusive, i.e., a nightly price exactly equal to budget_minor budget_maxis considered, too. The variable relative is a Boolean specifying whether relative numbers (fractions) should be considered in the recommendation. If False, absolute numbers are considered instead.
Recommender system 2: Price recommender for hosts
If a new host wants to offer their room / flat / house on Airbnb, they need to decide on what the nightly price will be. There is no official guidance but hosts have been requesting for Airbnb to provide an algorithm. After some deliberation, you work out that the function should meet these specifications:
• • The user has to provide the geolocation (latitude and longitude) of their property.
• • Your algorithm searches for the geographically closest properties (simply measured by Euclidean distance in terms of latitude/longitude) that are already listed on Airbnb. Your price recommendation will be the mean of the nightly prices of these closeby properties.
• • The user should be able to set the number of closeby properties (called neighbours) that are considered. A larger number of neighbours indicates a larger geographical area.
• • You can ignore the fact that some neighbours could be in a different neighbourhood.

• • Nightly prices are quite different for different room types. The user should be able to set the desired room type. If the room type is defined, only properties of the respective room type are taken into consideration.

From these specifications, you arrive at the following function signature:
recommend_price(df, latitude, longitude, n_neighbours, room_type)
with the variables being latitude and longitude representing geolocation of the property, n_neighboursthe number of neighbouring properties the user wants to take into account. room_type, if specified, restricts the neighbours search to properties of the given room type; it should default to None which means that any property type is considered.
To test your two recommendation system, provide function calls for each of the two functions. You can freely select the parameters of the function call.

Question 3

Part 3 - Text analysis and ethics
In this part, you will be working with the reviews.csv file providing reviews for the listings, and more specifically, the ‘comments' column.
Question 3a - Pointwise Mutual Information
In this question, you implement and apply the pointwise mutual information (PMI) metric, a word association metric introduced in 1992, to the Airbnb reviews. The purpose of PMI is to extract, from free text, pairs of words than tend to co-occur together more often than expected by chance. For example, PMI(‘new', ‘york') would give a higher score than PMI(‘new', ‘car') because the chance of finding ‘new' and ‘york' together in text is higher than ‘new' and ‘car', despite ‘new' being a more frequent word than ‘york'. By extracting word pairs with high PMI score in our reviews, we will be able to understand better how people feel and talk about certain items of interest (e.g., ‘windows' or ‘location').
The formula for PMI (where x and y are two words) is:
Watch this video to understand how to estimate these probabilities.
Your solution will involve the following steps:
1. (4 marks) Processing the raw reviews, applying the following steps in this specific order: a. Tokenize all reviews. Use nltk'sword_tokenizemethod.
b. Part-of-speech (PoS) tagging: to be able to differentiate nouns from adjectives or verbs. Use nltk'spos_tagmethod.

c. Lower case: to reduce the size of the vocabulary.

What to implement: A function process_reviews(df) that will take as input the original dataframeand will return it with three additional columns: tokenized, tagged and lower_tagged, which correspond to steps a, b and c described above.

2. Starting from the output of step 1.c (tokenized, PoS-tagged and lower cased reviews), create a vocabulary of ‘center' (the x in the PMI equation) and ‘context' (the y in the PMI equation) words. Your vocabulary of center words will be the 1,000 most frequent NOUNS (words with a PoS tag starting with ‘N'), and the context words will be the 1,000 most frequent words tagged as either VERB or ADJECTIVE (words with any PoS tag starting with either ‘J' or ‘V').

What to implement: A function get_vocab(df) which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives).
3. (8 marks) With these two 1,000-word vocabularies, create a co-occurrence matrix where, for each center word, you keep track of how many of the context words co-occur with it. Consider this short review with only one sentence as an example, where we want to get co-occurrences for verbs and adjectives for the center word restaurant: a. ‘A big restaurant served delicious food in big dishes'

>>> {‘restaurant': {‘big': 2, ‘served':1, ‘delicious':1}}
What to implement: A function get_coocs(df, center_vocab, context_vocab) which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach.

4. After you have computed co-occurrences from all the reviews, you should convert the co-occurrence dictionary as a pandas DataFrame. The DataFrame should have 1,000 rows and 1,000 columns.

What to implement: A function called cooc_dict2df(cooc_dict), which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFramewhere each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases.

5. Then, convert the co-occurrence values to PMI scores.

What to implement: A function cooc2pmi(df) that takes as input the DataFrame generated in step 4, and returns a new DataFramewith the same rows and columns, but with PMI scores instead of raw co-occurrence counts.

6. Finally, implement a method to retrieve context words with highest PMI score for a given center word.

What to implement: A function topk(df, center_word, N=10) that takes as input: (1) the DataFrame generated in step 5, (2) a center word (a string like ‘towels'), and (3) an optional named argument called N with default value of 10; and returns a list of N strings, in descending order of their PMI score with the center_word. You do not need to handle cases for which the word center_wordis not found in df.

Question 3b - Ethical considerations
Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of the flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise.

In this context, critically reflect on the compliance of this recommender system with one of the five actions outlined in the UK's Data Ethics Framework. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three key principles outlined in the Framework, namely transparency, accountability and fairness. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support.

Your report should be between 500 and 750 words long.

Attachment:- Python_modules.rar

Reference no: EM132881167

Questions Cloud

What are the pros and cons for accreditation : What type of accreditation does an assisted living facility need?

Human service field hsv 221 pre-practicum : In your opinion, What have you learned in these settings that have helped you grow in the human service field HSV 221 Pre-practicum?

Discuss the hrm skills : Judy is an HR manager for a mid-size company. She sometimes has lunch with her friend Jane who works as an accounting manager in the same company.

Calculate willerton market value based capital structure : 1) Willerton Industries Inc. has the following balances in its capital accounts as of 12/31/X3:

Critically reflect on the compliance of recommender system : Critically reflect on the compliance of recommender system with one of the five actions outlined in the UK's Data Ethics Framework.

What is the amount of total assets at the end of the year : At the beginning of the year, Clint Company had total assets of $ 800,000. What is the amount of total assets at the end of the year

What is thee interest income : The bonds mature on January 1, 2025 and were purchased for P5,550,000 to yield 11%. What is thee interest income for 2022

Record the journal entry to recognize employee payroll : Record the journal entry to recognize employee payroll for the month of May, dated May 31, 2017

What is the beta of your portfolio : You have a portfolio that is equally invested in Stock F with a beta of 1.14, Stock G with a beta of 1.51 and the market. What is the beta of your portfolio

User Account

All Pages