python

Strava and Ride with GPS APIs

Strava and Ride with GPS are great, but I want more. It's time to use their APIs to customize my cycling data and finally learn how much faster I've gotten on the bike over the last year.

Github

Introduction

I started biking consistently in late-2020, and as a data-minded individual, I immediately wanted a way to track my rides and therefore my progress on the bike. I turned to Strava and Ride with GPS, two activity trackers where the user can upload a ride that was recorded on a phone to view your distance, speed, total elevation gain etc. I’ve used both religiously since and have now logged over 200 activities and more than 4,000 miles.

Both of these services use something called 'segments' -- portions of road or trail (typically uphill drags) created by members so athletes can compare times, and these segments are one of the best ways amateurs can track their fitness over time. Ride with GPS does a great job of giving a broad overview of my fitness level over the years but lacks the granularity needed to really track segments effectively. Strava is much better in this regard, but it only produces a nice little graph with my forty most recent efforts on a segment. This too is limiting though. I have made much more than forty attempts on segments near my house and currently Strava doesn’t have a good way of monitoring this.

Luckily, I’ve been learning Python! By accessing both services APIs I can gain access to my own data and format it exactly how I want to finally answer my own cycling questions -- namely, how much faster have I gotten over the last year?

Accessing Strava's API

I started with Strava. After submitting an API application, I gained access to my Client ID, Client Secret, Access Token, and Refresh Token. We’ll need all four of these to make a request to the Strava API, but the two most important ones for now are Client ID and Client Secret. In our code, we’ll use these two identifiers to request our Access Token and Refresh Token as these expire (essentially we need new ones every day).

After acquiring our Access Token and the link which we'll send the request to I can now get my data on every ride that I’ve uploaded to Strava and save it locally. However, for my analysis I really only need to keep a few attributes, which I can pair down. So after running this code:

1# Authentication URL 
2auth_url = "https://www.strava.com/oauth/token"
3
4# URL used to access all the data.
5activites_url = "https://www.strava.com/api/v3/athlete/activities"
6
7payload = {
8    'client_id': "INSERT CLIENT ID",
9    'client_secret': 'INSERT CLIENT SECRET',
10    'refresh_token': 'INSERT REFRESH TOKEN',
11    'grant_type': "refresh_token",
12    'f': 'json'
13}
14
15# Requesting an access token.
16print("Requesting Token...\n")
17res = requests.post(auth_url, data=payload, verify=False)
18access_token = res.json()['access_token']
19print("Access Token = {}\n".format(access_token))
20
21# Using acquired access token for API request.
22print("Requesting pages (200 activities per full page)...")
23activities_df = pd.DataFrame()
24page = 1
25page_non_empty = True
26while page_non_empty: 
27    header = {'Authorization': 'Bearer ' + access_token}
28    param = {'per_page': 200, 'page': page}
29    my_activities = requests.get(activites_url, headers=header, params=param).json()
30    activities_df = pd.concat([activities_df, pd.DataFrame([my_activities])], ignore_index=True)
31    activities_df = activities_df._append(my_activities, ignore_index=True)
32    page_non_empty = bool(my_activities)
33    print(page)
34    page = page + 1 
35
36print("\n", len(activities_df), "activities downloaded")
37
38# Only keeping columns I actually need for my analysis.
39cols = ['name', 'upload_id', 'type', 'start_date_local', 'distance', 'moving_time', 
40        'elapsed_time', 'total_elevation_gain', 'average_speed', 'max_speed', 
41        'achievement_count', 'average_cadence', 'average_watts', 'average_heartrate'
42       ]
43
44activities_df = activities_df[cols]
45
46print(activities_df)
47
48# Saving data to .csv file for future analysis.
49activities_df.to_csv('strava_activities_data.csv')

I've collected my Strava ride data and saved it to a .csv.

Accesing Ride with GPS' API

Accessing the Ride with GPS' API is similar to Strava's, however it doesn’t use a Refresh Token.

1# URL used to access all data
2activites_url = "https://ridewithgps.com/users/3350753/trips.json"
3
4access_token = 'INSERT TOKEN HERE'
5
6# API request
7header = {'Authorization': 'Bearer ' + access_token}
8param = {'per_page': 200, 'page': 1}
9response = requests.get(activites_url, headers=header, params=param)
10activities_data = response.json()
11activities_df = pd.DataFrame(activities_data)
12activites_df = json_normalize(activities_df)
13
14# Only keeping the columns that are actually useful to my analysis.
15cols = ['user_id', 'id', 'created_at', 'distance', 'duration', 'elevation_gain', 'avg_cad', 'max_cad', 'avg_speed',
16        'max_speed', 'moving_time', 'avg_watts', 'max_watts', 'calories', 'locality', 
17]
18activities_df = activities_df[cols]
19
20pd.set_option('display.max_columns', None)
21print(activities_df)
22
23# Saving data to .csv file for future analysis.
24activities_df.to_csv('ridewithgps_activity_data.csv')

I now have all my ride data from Strava and Ride with GPS and can begin the analysis.

Data Cleaning

While looking through the data I collected from Strava I noticed the units of measurement were largely foreign to me as a cyclist -- distance was in meters, moving time in seconds, and average speed in meters/second. I’m much more comfortable with miles, hours and minutes, and miles/hour so my first goal was to alter the attribute names (distance --> distance_m) and make new columns in the Pandas data frame with converted units of measurement (meters --> miles). I also altered the format of the date of activity.

1#Renaming columns with their units of measurement.
2strava_activities_df = strava_activities_df.rename(columns = {'distance':'distance_m',
3                                                              'moving_time':'moving_time_sec',
4                                                              'elapsed_time':'elapsed_time_sec',
5                                                              'total_elevation_gain':'total_elevation_gain_m',
6                                                              'average_speed':'average_speed_m/sec',
7                                                              'max_speed':'max_speed_m/sec'})
8
9# Change start_date_local to datetime format. Then making new column with only month-day-year.
10strava_activities_df['start_date_local'] = pd.to_datetime(strava_activities_df['start_date_local'])
11strava_activities_df['activity_date'] = strava_activities_df['start_date_local'].dt.strftime('%Y-%m-%d')
12strava_activities_df['activity_date'] = pd.to_datetime(strava_activities_df['activity_date'])
13
14# Distance column is in meters. Adding column that's miles.
15strava_activities_df['distance_mi']=strava_activities_df['distance_m'].apply(lambda x : x*0.000621371)
16
17# Moving time column is in seconds. Adding column that's hours:minutes:seconds
18strava_activities_df['datetime'] = pd.to_datetime(strava_activities_df['moving_time_sec'], unit = 's')
19strava_activities_df['moving_time_hr:min:sec'] = strava_activities_df['datetime'].dt.strftime('%H:%M:%S')
20del strava_activities_df['datetime']
21
22# Elapsed time is in seconds. Adding column that's hours:minutes:seconds
23strava_activities_df['datetime'] = pd.to_datetime(strava_activities_df['elapsed_time_sec'], unit = 's')
24strava_activities_df['elapsed_time_hr:min:sec'] = strava_activities_df['datetime'].dt.strftime('%H:%M:%S')
25del strava_activities_df['datetime']
26
27# Total elevation gain is in meters. Adding column that's in ft.
28strava_activities_df['total_elevation_gain_ft'] = strava_activities_df['total_elevation_gain_m'].apply(lambda x : x*3.28084)
29
30# Average speed is in m/sec. Adding column that's mph.
31strava_activities_df['average_speed_mph'] = strava_activities_df['average_speed_m/sec'].apply(lambda x : x*2.236936)
32
33# Max speed is in m/sec. Adding column that's mph.
34strava_activities_df['max_speed_mph'] = strava_activities_df['max_speed_m/sec'].apply(lambda x : x*2.236936)

Analysis of All Activities

With the data altered appropriately it was easy to make some quick graphs to see how my average speed has changed over time.

It’s a graph that doesn’t necessarily show much progress in getting faster until we get to June 2022, which coincidentally is right when I moved from Portland, OR to Newport Beach, CA. I can create a new graph that will show this quite explicitly.

I’d still love to break this data down even further by location. I’ve cycled mainly in four places – Santa Cruz, Portland, online with Zwift, and Newport Beach. The main reason I think this graph doesn’t show much progress early in my cycling journey is because of Zwift, a virtual training application. Because it being virtual the simulated wind resistance is much different than in real life allowing you to maintain a higher average speed with less effort. I have a hunch that my time using Zwift is clouding my real progress.

However, we’ll need to switch to my data collected with Ride with GPS as it tracks these localities while Strava does not. Before beginning though, I’ll need to go through the same cleaning process as before and change the attribute names and add new columns with the converted units of measurement.

With that out of the way, I can begin mapping my ride locations with a broader region. For example, Zwift has several virtual worlds you can ride in like London and New York. I’ll need to map these localities to one broader location—Zwift. The same goes for all rides completed in Santa Cruz, Portland, and Newport Beach.

1# Mapping each locality into broader location buckets.
2location_mappings = {
3    'Laguna Beach': 'Newport Beach',
4    'Newport Beach': 'Newport Beach',
5    'Newport Center': 'Newport Beach',
6    'Orange County': 'Newport Beach',
7    'Irvine': 'Newport Beach',
8    'Tustin': 'Newport Beach',
9    'Portland': 'Portland',
10    'Live Oak': 'Santa Cruz',
11    'Opal Cliffs/Pleasure Point': 'Santa Cruz',
12    'Harrogate': 'Zwift',
13    'Innsbruck': 'Zwift',
14    'London': 'Zwift',
15    'Murivai': 'Zwift',
16    'Nea': 'Zwift',
17    'noname': 'Zwift',
18    'NYC': 'Zwift',
19    'Thio': 'Zwift'
20}
21
22# Adding column with the broader location buckets.
23rwgps_df['broad_location'] = rwgps_df['locality'].map(location_mappings).fillna('Zwift')

I can now make a graph showing my rides filtered by each ride location.

Like I thought, Santa Cruz shows decent progress in a limited amount of rides, Zwift is all over the place, Portland is relatively stable, and Newport Beach shows a lot of progress.

I also created a new data frame to display my average speed and average elevation gain per each location.

Location	Average Speed	Average Elevation Gain
Newport Beach	14.179539	1,451.867834
Portland	13.843482	859.374791
Zwift	17.009829	964.918962
Santa Cruz	14.140310	912.804895

Scroll to view the full table.

It clearly shows how much faster you can go on Zwift compared to real life, but I also think this does a great job demonstrating my progress on the bike. I’m only averaging .4 mph faster in Newport Beach than in Portland, but with an average of 590ft more climbing per ride.

Strava Segment Effort Analysis

I’ll actually need to do another API request as the URL I'm requesting my data from is different: https://www.strava.com/api/v3/segment_efforts. It’s the exact same process as before just with different parameters as this time I’ll need to include the Segment ID. I’ll similarly only be keeping certain attributes from this collected data and then will need to change these attribute names and make new columns with the converted units of measurement.

I’ve chosen six segments to visualize, five of which are climbs and one flat. For each segment I made a graph plotting my average speed to the date. I also included a trend line to easily show any progress and a highlighted plot point to display the fastest time recorded.

Wow, the progress is very real. I’ve gotten faster on literally every segment, and by quite a sizeable margin. The segment Backside Newport Coast from Turtle Ridge is the climb I ride most often and I can do it about two minutes faster now than I could a year earlier. That’s 20% faster!

Conclusion

I was able to use Python to access the Strava and Ride with GPS APIs to get the data I needed and then used Pandas to manipulate that data to provide insights into my cycling performance. I found that I've slowly but surely become raised my average speed on the bike despite there being more elevation gain per ride. And on certain segments I determined I'm over 20% faster now than this time last year.

More than anything, I’m blown away by the possibilities APIs present. Strava and Ride with GPS are great services, but they still lack all the functionality someone like me desires. By learning how to access their APIs (and therefore essentially any other API) I now have the ability to customize the data and build all the functionality I’ll ever need.

Learning Take-Aways

I love Python. So far, most of my data projects have used SQL or Excel, and while they each have their own benefits, I think I enjoy Python the most. Its ability to access data, clean it, and then analyze it all in one make it incredibly versatile. I'm beyond excited to keep utilizing Python, and I can't wait to see what data I'll be analyzing next.

That being said, Python is challenging, but not overly so when getting some help from online. There's a ton of great resources for beginners like me, and it seems all the questions I had while doing this project, other people did too. It was honestly pretty enjoyable seeing what people were working on while asking the same questions as me.

APIs can unlock a whole new world of data. Before beginning this project I knew APIs would be really important, and after using this tool for my project I'm so excited about the possibility they hold. All my other projects have been limited to datasets that were already gathered and curated by others, but now that I can collect data on my own with the help of APIs the world is my oyster.