python

Web Scraping OrNot - Part 1

OrNot is my favorite cycling apparel brand, but do others think the same thing? Let's find out with the help of web scraping.

Github

Introduction

I’ve got two Python projects under my belt. The first looked at crime in Los Angeles and the second analyzed my cycling performance. Both have been progressive steps up. I didn’t know Python before starting the first project, and for the second, I was accessing my first API. With a solid foundation, I wanted to move another step further – web scraping.

If you’ve made it this far in my portfolio I think it would be fair to say I like cycling. For this project, I decided to scrape customer reviews from my favorite cycling apparel company, OrNot. They’re based in California, most of their products are made in the US, no massive logos adorn their clothes, and their jerseys are downright super comfortable. So here we go!

Web Scraping

I’m going to be using a library I’ve never used before – Selenium. This will allow me to interact with the webpage like a normal user and close popups, scroll down, click on the next page, etc. I began by collecting all the URLs of the products I would need to scrape. This was a really simple loop once I figured out there were six pages of products.

1# Collecting all the urls I'll need to scrape
2product_urls = []
3
4for page_number in range(1, 6):
5    url = f'https://www.ornotbike.com/collections/mens?page={page_number}'
6
7    driver = webdriver.Chrome()  # Initialize the driver (you may need to adjust the driver path)
8    driver.get(url)
9
10    # Getting links for all products
11    print('Getting all product links for Page', page_number)
12
13    find_all_products = driver.find_elements(By.TAG_NAME, 'a')
14
15    for product in find_all_products:
16        if product.get_attribute('class') == 'product-link':
17            product_url = product.get_attribute('href')
18            if product_url not in product_urls:
19                product_urls.append(product_url)
20                print(product_url)

With all of the product URLs collected, I now needed to construct another loop to iterate over each URL. For each URL I wanted to collect the following information:

user – person who submitted review
review_post_date – date the review was submitted
location_of_review – user’s location
star_ratings – user’s grade of the product on a scale of 1-5
review_headers – title of the review
review_body – actual review
fit_rating – user’s grade of the product’s fit on a scale of 1-5
products – the name of the product the user was reviewing
size_ordered
reviewers_height_and_weight

I was able to inspect the webpage and tell Selenium how to find each element. If it couldn’t find an element (someone didn’t provide their height or weight for example), it would store N/A instead. Once the elements were found, the program clicked to find the next page of reviews for that product, and if there were no more pages of reviews it moved to the next URL. After completing this loop for each URL, I saved the results to a .csv.

1driver = webdriver.Chrome()
2
3# Function to scrape reviews from a given product URL
4def scrape_reviews(product_url):
5
6    driver.get(product_url)
7    time.sleep(5)
8        
9    try:
10        find_popup = driver.find_element(By.CSS_SELECTOR, 'button.needsclick.klaviyo-close-form.kl-private-reset-css-Xuajs1')
11        find_popup.click()
12        time.sleep(3)
13    except Exception as e:
14        print("Error while closing popup:", e)
15
16    # Initialize lists to store the extracted information
17    users = []
18    review_post_date = []
19    location_of_review = []
20    star_ratings = []
21    review_headers = []
22    review_body = []
23    fit_rating = []
24    products = []
25    size_ordered = []
26    reviewers_height_and_weight = []
27
28    while True:
29
30        # Find all elements with the class 'stamped-review'
31        review_elements = driver.find_elements(By.CLASS_NAME, 'stamped-review')
32
33        # Loop through each 'stamped-review' element and extract the information
34        for review in review_elements:
35
36            # Extract the username from the 'strong' tag with the class 'author'
37            try:
38                username_element = review.find_element(By.CLASS_NAME, 'author')
39                username = username_element.text
40                users.append(username)
41            except:
42                users.append('N/A')
43
44            # Extracting date of review when posted if available
45            try:
46                date_element = review.find_element(By.CLASS_NAME, 'created')
47                date = date_element.text
48                review_post_date.append(date)
49            except:
50                review_post_date.append('N/A')
51
52            # Extracting location of review if available
53            try:
54                location_element = review.find_element(By.CSS_SELECTOR, '.review-location')
55                location = location_element.text
56                location_of_review.append(location)
57            except:
58                location_of_review.append('N/A')
59
60            # Extracting star rating if available
61            try:
62                star_rating_element = review.find_element(By.CSS_SELECTOR, '.stamped-starratings')
63                star= star_rating_element.get_attribute('data-rating')
64                star_ratings.append(star)
65            except:
66                star_ratings.append('N/A')
67
68            # Extracting heading of review if available
69            try:
70                header_element = review.find_element(By.CLASS_NAME, 'stamped-review-header-title')
71                header = header_element.text
72                review_headers.append(header)
73            except:
74                review_headers.append('N/A')
75
76            # Extracting body of review if available
77            try:
78                review_body_element = review.find_element(By.CLASS_NAME, 'stamped-review-content-body')
79                body = review_body_element.text
80                review_body.append(body)
81            except:
82                review_body.append('N/A')
83                
84            # Extracting fit of product if available
85            try:
86                size_element = review.find_element(By.CSS_SELECTOR, 'div.stamped-review-variant')
87                size = size_element.text
88                size_ordered.append(size)
89            except:
90                size_ordered.append('N/A')
91
92            # Extracting product type if available
93            try:
94                product_element = review.find_element(By.CSS_SELECTOR, 'a[href*=\'stamped.io/go/\']')
95                product = product_element.text
96                products.append(product)
97            except:
98                products.append('N/A')
99
100            # Extracting size of product if available
101            try:
102                fit_element = review.find_element(By.CLASS_NAME, 'stamped-review-option-scale')
103                fit = fit_element.get_attribute('data-value')
104                fit_rating.append(fit)
105            except:
106                fit_rating.append('N/A')
107
108            # Extracting height and weight of user if available
109            try:
110                height_weight_element = review.find_element(By.CSS_SELECTOR, 'li[data-value="what-is-your-height-and-weight"]')
111                span_element = height_weight_element.find_element(By.CSS_SELECTOR, 'span[data-value]')
112                height_weight = span_element.text
113                reviewers_height_and_weight.append(height_weight)
114            except:
115                reviewers_height_and_weight.append('N/A')
116            
117        # Check if the "Next page" button is clickable
118        try:
119            next_button = WebDriverWait(driver, 10).until(
120            EC.element_to_be_clickable((By.CSS_SELECTOR, 'li.next > a[aria-label="Next page"]'))
121            )
122        
123            # Click the "Next page" button and wait for the new page to load
124            driver.execute_script("arguments[0].scrollIntoView({block: 'center', inline: 'center', behavior: 'instant'});", next_button)
125            driver.execute_script("arguments[0].click();", next_button)
126            time.sleep(5)
127        except Exception as e:
128            print("Error while clicking 'Next page' button:", e)
129            break
130
131    return users, review_post_date, location_of_review, star_ratings, review_headers, review_body, fit_rating, products, size_ordered, reviewers_height_and_weight
132
133all_users = []
134all_users_cleaned = []
135all_review_post_date = []
136all_review_post_date_cleaned = []
137all_location_of_review = []
138all_star_ratings = []
139all_review_headers = []
140all_review_body = []
141all_review_body_cleaned = []
142all_fit_rating = []
143all_product = []
144all_sizes_ordered = []
145all_reviewers_height_and_weight = []
146
147for url in product_urls:
148    users, review_post_date, location_of_review, star_ratings, review_headers, review_body, fit_rating, products, size_ordered, reviewers_height_and_weight = scrape_reviews(url)
149
150    all_users.extend(users)
151    all_review_post_date.extend(review_post_date)
152    all_location_of_review.extend(location_of_review)
153    all_star_ratings.extend(star_ratings)
154    all_review_headers.extend(review_headers)
155    all_review_body.extend(review_body)
156    all_fit_rating.extend(fit_rating)
157    all_product.extend(products)
158    all_sizes_ordered.extend(size_ordered)
159    all_reviewers_height_and_weight.extend(reviewers_height_and_weight)
160
161    print('Users count:', len(all_users))
162    print('Date count:', len(all_review_post_date))
163    print('Location count:', len(all_location_of_review))
164    print('Star count:', len(all_star_ratings))
165    print('Header count:', len(all_review_headers))
166    print('Body count:', len(all_review_body))
167    print('Fit count:', len(all_fit_rating))
168    print('Product count:', len(all_product))
169    print('Size count:', len(all_sizes_ordered))
170    print('Height weight count:', len(all_reviewers_height_and_weight))
171
172data1 = {
173    'users': all_users, 
174    'date': all_review_post_date, 
175    'location': all_location_of_review, 
176    'star': all_star_ratings, 
177    'headers': all_review_headers, 
178    'body': all_review_body, 
179    'fit': all_fit_rating, 
180    'product': all_product, 
181    'size': all_sizes_ordered, 
182    'height and weight': all_reviewers_height_and_weight, 
183}
184
185df1 = pd.DataFrame(data1)
186df1.to_csv('ornotdata_trouble1.csv', encoding = 'utf-8-sig')

This proved to be a rather finicky process. Occasionally the program would break because a pop-up was activated and needed to be clicked. Other times I think Selenium had some difficulty knowing where to scroll to. It also collected, for a reason I’m still unsure, hundreds of rows of completely null data. Ultimately though, I was able to collect every review for every item listed in the men’s collection on the OrNot website.

Data Cleaning

So began the cleaning process. The first and most obvious step was to remove any rows that were completely null. I also wanted to make sure I didn’t collect any duplicate entries so I dropped those as well. From there I went through the star_ratings, review_headers, review_body, and products, to find any nulls, and if there were any, dropping them. The main goal of this project is to analyze each of these categories, and if there are any nulls, they’re essentially worthless. All told, after the collection and cleaning process, I had 5, 285 reviews to work with.

1# Dropping any rows with all null data
2ornot_df.dropna(how = 'all', inplace = True)
3# print(ornot_df.info())
4
5# Dropping any duplicate entries
6ornot_df.drop_duplicates()
7# print(ornot_df.info())
8
9# We now have a username and date for every row, but we're still missing data from important attributes like star, headers, body, and product. I'm going to check these out one by one.
10null_star = ornot_df[ornot_df['star'].isnull()]
11# print(null_star)
12
13# These only have username, date, and location. Dropping these rows
14ornot_df.dropna(subset = ['star'], inplace = True)
15# print(ornot_df.info())
16
17# Checking the null headers
18null_header = ornot_df[ornot_df['headers'].isnull()]
19# print(null_header)
20
21# Okay, these are fine. I didn't know you could leave a review without a header, but you can. Keeping.
22# Checking any null reviews
23null_review = ornot_df[ornot_df['body'].isnull()]
24# print(null_review)
25
26# These aren't worth keeping. Dropping
27ornot_df.dropna(subset = ['body'], inplace = True)
28# print(ornot_df.info())
29
30# Lastly, checking the product type that was reviewed
31null_product = ornot_df[ornot_df['product'].isnull()]
32# print(null_product)
33
34# It'll be hard to run analysis if I don't know the product getting reviewed. Dropping
35ornot_df.dropna(subset = ['product'], inplace = True)
36# print(ornot_df.info())

Data Analysis

I began with some really simple data exploration.

1# Average star ratings for all reviews
2average_star_rating = ornot_df['star'].mean()
3
4# Plotting all star ratings
5star_count = ornot_df['star'].value_counts()
6star_count.plot(kind = 'bar', figsize = (12, 8), color = 'royalblue')
7plt.title('Star Rating Distribution', fontweight = 'bold')
8plt.xlabel('Star Rating')
9plt.ylabel('Number of Reviews')
10plt.xticks(rotation=0)
11plt.tight_layout
12plt.show()

Over all reviews, OrNot averaged 4.86 stars out of 5, which is really impressive. I’m glad I’m not alone in my love for their products.

The fit of their clothing was just as spot on too.

1# Average fit of all products
2average_fit = ornot_df['fit'].mean()
3
4# Fit distribution
5fit_distribution = ornot_df['fit'].value_counts()
6order = [1, 2, 3, 4, 5]
7fit_distribution = fit_distribution.reindex(order)
8
9ax = fit_distribution.plot(kind = 'bar', figsize = (12, 8), color = 'royalblue')
10ax.set_xlabel('Fit Rating')     
11ax.set_ylabel('Number of Reviews')         
12ax.set_title('Fit Rating Distribution', fontweight = 'bold')  
13new_x_labels = ['small', 'small-ish', 'perfect', 'big-ish', 'big']
14ax.set_xticklabels(new_x_labels, rotation = 0)
15plt.tight_layout     
16plt.show()

The average fit was 2.90, with 3 being a perfect fit. So if anything, their clothes fit a bit snug but not by much.

I also wanted to see where most of these reviews were coming from.

1# Where are reviewers from?
2location_count = ornot_df['location'].value_counts()
3# print('Total Locations: ', location_count.sum())

Basically all were from the US at 94.98%.

We should also check out how many reviews they’ve gotten each year.

1# How many products reviewed by year
2ornot_df['year'] = ornot_df['date'].dt.year
3product_count_by_year = ornot_df['year'].value_counts().sort_index()
4product_count_by_year.plot(kind = 'bar', figsize = (12, 6), color = 'royalblue')
5plt.title('Number of Products Reviewed by Year', fontweight = 'bold')
6plt.xlabel('Year')
7plt.ylabel('Number of Products Reviewed')
8plt.xticks(rotation=0)
9plt.xticks(fontsize=10)
10plt.tight_layout()
11plt.show()

It’s interesting to see their explosion in reviews in 2020 followed by another big amount in 2021 before they start to back off in 2022. I know cycling-related sales really took off across the whole industry in 2020 thanks to the pandemic. I remember it was really hard getting certain mechanical parts, but it’s cool to see that OrNot probably benefited from the influx of new cyclists as well. I’d love to know how closely the amount of reviews correlates to their actual sales figures.

I’m also curious to see which of their products are the most reviewed, but we’ve run into a bit of an issue because the two most reviewed products are House Bib Shorts – Black and House Bib Shorts – Stone Blue which are the same product, just different color. While it’s obviously good information to know that they’ve received more reviews for the black than stone blue, I want a more macro look. Which product categories (jerseys, bibs, jackets, shirts, etc.) get the most reviews?

1# How many bought of each prodcut
2ornot_df['product'] = ornot_df['product'].str.lower()
3product_count = ornot_df['product'].value_counts()
4print(product_count)
5
6# Grouping products by broader category i.e jersey, bib, sock...
7grouping_products = {
8    'Bibs/Tights': [r'bib', r'leg warmer'],
9    'Jerseys': [r'jersey', r'base layer'],
10    'Jackets/Vests': [r'jacket', r'vest'],
11    'Shirts/Pullovers': [r'shirt', r'pullover'],
12    'Shorts/Pants': [r'mission', r'boxer'],
13    'Socks/Caps/Hat/Gloves': [r'sock', r'cap', r'hat', r'shoe', r'glove', r'beanie', r'neck'],
14#   'Other': [r'dynaplug', r'gift', r'cygolite', r'belt', r'tool', r'topeak', r'bag', r'kom']
15}
16
17def map_product_category(product):
18    product = product.lower()
19    for category, key_words in grouping_products.items():
20        for key_word in key_words:
21            if re.search(key_word, product):
22                return category
23    return 'Other'
24
25ornot_df['product_category'] = ornot_df['product'].map(map_product_category)
26product_count_by_category = ornot_df['product_category'].value_counts()
27product_count_by_category.plot(kind = 'bar', figsize = (12, 8), color = 'royalblue')
28plt.xlabel('Product Category')
29plt.ylabel('Number of Products')
30plt.title('Number of Products Reviewed by Category', fontweight = 'bold')
31plt.xticks(fontsize=10)
32plt.xticks(rotation=45)
33plt.tight_layout()
34plt.show()

After grouping their products into broader categories we find that jerseys get the most reviews followed by bibs/tights and then jackets/vests. I was honestly a bit surprised to find that the Other category which includes things like tools, lights, and bar bags had more reviews than socks/caps/hat/gloves. Maybe this is because OrNot gives away free mini bar bags on first orders over $99? Something to investigate in the future perhaps.

Let’s see the average star rating for each of these product categories.

1# Average star rating per product category
2average_star_by_category = ornot_df.groupby('product_category')['star'].mean()
3print(average_star_by_category)

And all categories have a rating of 4.81 or higher. No real surprise there given that all reviews for the men’s collection averaged 4.86.

Lastly, I’m curious about the sizing for products. I’ll need to do some more grouping as the sizes for the jerseys is different from the shorts which is different from the pants.

1# What sizes are most commonly reviewed
2# Grouping sizes
3grouping_sizes = {
4    'x small': [r'extra small', r'xs'],
5    'small': [r'small', r'sm', r'mens small'],
6    'medium': [r'medium', r'md', r'medium / synthetic', r'mens medium', r'medium / merino wool'],
7    'xx large': [r'xx-large', r'xx - large', r'xx large', r'xxl', r'xx- large'],
8    'x large': [r'extra large', r'x-large', r'xl', r'mens x-large'],
9    'large': [r'large', r'lg', r'mens large']
10}
11
12def map_size_category(size):
13    if pd.notna(size):
14        size = size.lower()
15        for category, key_words in grouping_sizes.items():
16            for key_word in key_words:
17                if re.search(key_word, size):
18                    return category
19        return size
20    else:
21        return 'No size provided by reviewer'
22
23ornot_df['size_category'] = ornot_df['size'].map(map_size_category)
24
25pd.set_option('display.max_rows', None)
26
27size_count = ornot_df['size_category'].value_counts()
28
29# Most common sizes for shirts, shorts, and pants
30cycling_clothes_sizes = ['x small', 'small', 'medium', 'large', 'x large', 'xx large']
31short_sizes = ['28', '30', '32', '33', '34', '36', '38']
32pant_sizes = ['28x32', '30x32', '30x34', '32x32', '32x34', '34x32', '34x34', '36x34', '38x34']
33cycling_clothes_size_count = size_count[size_count.index.isin(cycling_clothes_sizes)]
34short_size_count = size_count[size_count.index.isin(short_sizes)]
35pant_size_count = size_count[size_count.index.isin(pant_sizes)]
36
37# Plotting all sizes
38plt.figure(figsize=(15, 5))
39
40# Reindexing sizes so they graph better
41cycling_clothes_size_count_sorted = cycling_clothes_size_count.reindex(cycling_clothes_sizes)
42short_size_count_sorted = short_size_count.reindex(short_sizes)
43pant_size_count_sorted = pant_size_count.reindex(pant_sizes)
44
45# Cycling clothes sizes
46plt.subplot(1, 3, 1)
47cycling_clothes_size_count_sorted.plot(kind = 'bar', color = 'royalblue')
48plt.title('Jerseys, Bibs, etc.')
49plt.xlabel('Size')
50plt.ylabel('Count')
51plt.xticks(rotation=45)
52
53# Short sizes
54plt.subplot(1, 3, 2)
55short_size_count_sorted.plot(kind = 'bar', color = 'seagreen')
56plt.title('Shorts')
57plt.xlabel('Size')
58plt.xticks(rotation=45)
59
60# Pant sizes
61plt.subplot(1, 3, 3)
62pant_size_count_sorted.plot(kind = 'bar', color = 'lightcoral')
63plt.title('Pants')
64plt.xlabel('Size')
65plt.xticks(rotation=45)
66plt.gca().yaxis.set_major_locator(ticker.MaxNLocator(integer=True))
67plt.suptitle('Sizing Distributions by Category', fontweight = 'bold')
68plt.tight_layout()
69plt.show()

It seems most reviewers are around a medium regardless of the product category.

Conclusion

In this project I was able to scrape the OrNot company’s website to get a collection of all their customer reviews and performed some cursory analysis to get a feel for customer satisfaction. It turns out I’m not alone in really loving OrNot. No matter the product category, their quality and fit all consistently rate really highly.

With all of the exploratory analysis done, I can move on to working with the actual reviews. I’m saving that for another write-up though because this is already long and took lots of work. Part 2 coming soon!

Learning Take-Aways

Web scraping is great when APIs aren't available, but also a lot harder to implement correctly. I would much rather use an API as they're designed to be easy to access rather than just finding random elements on a webpage by yourself. It was fun learning how to do some web scraping, but I can already see how it can get really challenging really fast.

The data collection and data cleaning are the hardest, most labor intensive parts. The actual analysis on the other hand is much more straight forward. For projects in the future I'll keep this in mind, especially if there are strict deadlines, so I can start working on collecting and processing the data as efficiently as possible.

The power to collect data from the internet is fun. It really is! It was honestly amazing to watch my program work its magic and flip through the URLs like nothing. It almost felt like a super power. I can't wait to keep practicing this skill so I can work with any data I want, not just datasets that have already been curated for me by others.