Everyone is freaking out over San Francisco astronomically high rent prices right now when Seattle real-estate isn’t that far behind.
I was walking down the street once in the university district around the University of Washington when I saw construction being built. And then more construction. And then MORE construction down the block. Ridiculous. Looking at a half-torn up flyer, the prices for these new apartments coming out in 2015 were around 1300 per bedroom for a two bedroom apartment! I quickly went home and started trying to figure out how fast rent was rising in Seattle.
One fun way to do this was by working on a project that I have since put off for school back in November. Since reliably the best place to find apartments has been on Craigslist, I created a script using Scrapy to grab listings of apartments on Seattle craigslist and filtered them for the zipcodes within the Seattle boundaries.
I manually ran the script for about a week and after filtering out duplicate posts and duplicate IDs, I got 6000 individual listings within the metro region and 2400 unique listings within the city of Seattle.
Hopefully by next year, I can finish automatically pipelining the data straight to the database on an automatic bash script that will just run everyday so I can track prices over time. I was halfway through it until midterms arrived at the door. Stupid school.
For a tutorial on scraping craigslist with Scrapy and regression analysis – Practical Web Scraping with Scrapy
First I plotted all of the Seattle apartment listing prices.
And here are the average prices per number of bedrooms.
Looks pretty reasonable. Now let’s look at the Seattle craigslist median apartment prices in comparisons to San Francisco’s median prices.
Ouch, well maybe they are pretty far ahead. But to be fair, Seattle’s neighborhoods are probably more varied in cost compared to San Francisco’s huge concentration of very expensive apartments in the mission and north of Bayview.
Now let’s say that we want to do the same thing and see how much the same apartment will cost in different neighborhoods of Seattle. Without nearly enough data to get accurate average values of 1 bedroom apartments or 2 bedroom apartments, let’s try a different method for finding variable prices of neighborhoods. Instead of just specifying the average values for a neighborhood with a 1 bedroom apartment, we can run a regression model to see how the neighborhood affects prices while holding other factors such as square footage and number of bedrooms etc…
In this case, we are going to use zipcodes and label them as factors in our model. Running a regression model for number of beds, number of baths, neighborhood zipcodes, and square footage size, you can see how significant each zipcode/neighborhood matters for the base price.
The bar graph above shows how much each location would cost without factoring any extra beds, baths, or square footage size. Basically, if you want to live in downtown, prepare to fork over 500+ more dollars per month for the exact same apartment anywhere in North Seattle. If we add 95% confidence intervals to the values, we can see which neighborhoods have more or less range in the base price estimator value.
We can see here that for zipcodes that generally encompass more neighborhoods, their confidence intervals are larger as the differen in prices in nearby neighborhoods are greater. For example, Seattle’s Central District encompasses parts of SODO, Mt. Baker, North Beacon Hill, and South Downtown with a zipcode of 98144. For neighborhoods in the downtown area, there is less of a range because the zipcode’s areas are smaller and correspond to their neighborhoods with more accuracy.
The biggest ones seem to be the Madison Park/Montlake and Columbia City. The zipcode of 98112 stretches from the west part of Lake Washington where Madison Park and the Arboretum lie, to all the way to Volunteer Park and some of the nice area of Capitol Hill. There’s bound to be more variance when there’s such a large area to cover of multiple neighborhoods, especially ones harboring nice restaurants such as Harvest Vine which my girlfriend tells me is quite classy and spectacularly organic.
Now looking at the zipcode of 98118, we can see that it’s actually comprised of quite a number of different neighborhoods as well. For the most part if we were to look directly at the map, we would probably classify the neighborhood as mainly Rainier Valley. As most people know, Rainier Valley is one of the poorest neighborhoods in Seattle today. So why is the zipcode right there next to West Seattle and Magnolia in price? Well there’s a few reasons why actually. Taking a look at the data, the listings mostly had titles in Columbia City. Columbia City is going through what wikipedia describes as “gentrification” and has become a “relatively trendy neighborhood” in the last couple of years. The large range in the confidence interval could then describe the variable neighborhoods of the low income Rainier Beach to the high-end houses that overlook Seward Park mixed in with many now expensive Columbia City town-homes and apartments. But in more interesting thought, if Seattle’s rate of expansion and growth starts matching San Francisco’s soon, could Rainier Valley become the next Mission District?
If we take a look at the rest of the factors, we can then get a decent prediction for how much a future apartment would cost given the rest of the factors.
So let’s say that I want to estimate the price of an apartment in Capitol Hill. I will take 900 square footage in space, two bedrooms because I am still too scared to live alone, and two bathrooms just cause of personal bathroom issues.
Price = Base Price of Capitol Hill (808) + Square Footage(9 * 69.68) + 2*322 + 2*107 = 2,293 dollars.
Not too bad.
Let’s see if we can shift gears again and try to predict the price by adding in a couple more factors. Obviously all apartments wouldn’t be so easily calculable because there are a lot more considerations when we go out to find a new place to live. This one was below market value according to our model and honestly looks like a steal when looking at that picture of the rooftop.
But how do we actually find ways to incorporate these additional amenities such as “rooftop” or maybe the word “penthouse” in a posting? When we look for apartments, we won’t obviously just check to see if it’s within a neighborhood and price range along with a specified amount of beds and bathrooms. We need to look at pictures! Lots of apartments are dirt cheap because even if they’re huge they might be old or disgusting. Since I haven’t written a computer vision algorithm to detect whether or not an apartment looks awesome yet, we can try using the number of pictures in a posting and also the word count of the posting as well to see if these factors are significant enough.
Re-running the regression by adding them in and cross validating the model to check for over fitting, I found that it is a slightly better predictor of price than the original model, but not by much. [DATA IN UPCOMING TUTORIAL]
Each picture added to a craigslist posting, it adds 9 more dollars to the price and each line in the body of a posting adds 2.83 dollars. Each doriginal base price has now changed as well as the other factors. But does this mean now that all realtors should start posting as many pictures as they can and start writing lines and lines of garbage in their postings in order to jack up their prices? Absolutely not. It just shows that most listings that have more pictures are generally priced higher. Maybe because people’s who’s apartments are cheaper do not exactly want to post pictures of why their apartments are so much cheaper.
What’s my future goal of this project? I am not sure but I know I want to start collecting more data. The awesome thing about data analysis and data science is that you start finding things that lead to more questions. I think I can list a couple of them already. One is that I don’t expect the price of an apartment in June to match the same price of an apartment in December. By collecting data over the course of a year, I can see how the date or month can affect the price and also notice if there are seasonal trends. My sister has had a hard time finding a cheap place right after school ended in June last year in Seattle when it seems like there may be better times to start a lease with less demand (especially when leases are yearly).
By collecting more data as well, the predictions should get more accurate and more text mining techniques can be implemented in finding out if specific words will result in more expensive apartments. Computer vision software in Python is advancing at a rapid rate to possibly detect brightly lit rooms or large window views. More data can also support a recommendation system or a system alert for a great deal that suddenly gets posted.
I will definitely update this post later in the next year. But for now, if you guys have any questions or can point out some things I did wrong which I might have probably done, PLEASE leave a comment or email me. I am always interested in different opinions and different ways to improve!
This is a great blog post with a fascinating problem. Going to subscribe to hear more info. If you are looking for additional projects let me know. Would love to touch base and see if there is way you can help us out.
Reblogged this on Seattle Property Investment and commented:
I should run this for my house 😉
That little green pocket surrounded by reds and oranges….
Great project! Kudos on the creativity. There’s actually a startup that’s a few years along in doing something similar (predicting fair market rental prices and identifying good deals): Kwelia.com. I’ve used them to find the best apartment deals the last two times I moved between cities in the Northeast US. Perhaps their site will lend some insight on potential uses of the technology and where to go next. Keep up the meaningful work!
Is there a way to look at archived or expired craigslist listings to see if you can match prices for the same buildings over time? One would suspect a building manager turning to CL first each time? Chances are they’ll post the address or cross street in each posting and copy/paste the building info changing only the details of the unit available.
It would be interesting to factor in transit (specifically the light rail, as it is further constructed in some of the neighborhoods you cover) and see how proximity to transit (via keywords “light rail station” “bus stop” etc) affects cost.
I live in a rental townhome in West Seattle and there are at least 7 or 8 new apartment buildings underconstruction right now, with a few of them pre-leasing right now. It would be interesting to include those, because it feels like there is an impending glut of housing about to hit the market here!
“We can see here that for zipcodes that generally encompass more neighborhoods, their confidence intervals are larger as the differen in prices in nearby neighborhoods are greater.”
just a typo to fix. I’m assuming you mean for “differen” to be difference. feel free to delete this comment if it doesn’t add any value.
Great post! Interested to see what your continued analysis turns up. Keep up the good work!
As a side note, The Stranger has ran a good number of stories on housing prices and such in the Seattle area. If you’re interested in having a guest piece printed on your findings once they’re fully developed, there’s a fair shot that they’d be interested in it.
Nice data and analysis! Have you tried using Tableau? You could get some even cooler visualizations from your data. Tableau Public is free: https://public.tableau.com/s/
I think the intro is misplaced. Building new units is the only way to combat rising prices. The higher end renters will simply filter up creating more openings for middle and lower end renters.
Good day! I could have sworn I’ve visited your blog before but after going through
many of the articles I realized it’s new to me. Regardless,
I’m definitely happy I discovered it and I’ll be book-marking it and checking back frequently!
Dude, great work! Keep it up.