Updated: 06/04/2025

A few days ago, I read this cool article posted on HackerNews that investigated the truth behind a French saying: “The closer to the train station, the worse the kebab.”

As a Frenchman living in Toulouse, I felt duty-bound to defend my country’s kebab sovereignty against this Swedish incursion 🥖🍷🇫🇷

I mean, what’s next? Tacos??

NEVVERRRR

In all seriousness though, I loved the mentality, and I was initially curious about how is it in Toulouse. But if I am investing the effort, then why not build investigate other French cities as well?

I also felt, as the author stated, that the analysis fell short (spoiler alert: it didn’t 😃).

Given that I never worked with GIS data before, that seemed like a good excuse to start.

My code is available here. Knock yourself out.

Revisiting the data collection steps

What is the data being used?

Position and Reviews for Kebab restaurants: this is acquired using Google Places APIs. The search for the Kebab restaurants is done by searching for the term “kebab”, within a circle centered on the train station. The search is repeated for each train station.
Location of the train station: This is discussed in the following subsection. Long story short, I acquired it from the SNCF database
The distance between the kebab restaurant and the train station: This is discussed more below. I used the straight-line distance

Correcting an issue: Location of the train station

To get the list of the stations, the author used Open Street Maps (OSM), by selecting the following tag railway=train_station_entrance.

Screenshot 2025-03-08 at 22.17.49.png

However, such a tag does not exist (see the list of OSM map features).

The right move was to use the railway=station tag (which will yield both cargo and passenger stations), and then filter on public_transport==station (to get the passenger only stations).

Screenshot 2025-03-08 at 22.21.24.png

So in essence, the main objective - the “Gare” - is not there (intentionally at least).

Besides, the original saying state “Gare”, the railway station, not subway stations, and not a RER (Regional Express Network) station.

If we limit the analysis to that, I can end up with a manageable problem.

Let’s get the train stations only then

Well, easy to say. I spent many fruitless hours on this.

Get it from OSM: I tried every combination of filters / tags from map features, but the results, consistently, keep returning subway and regional express stations! Frustrating!
Get it from Google Maps Places API: I tried to search for “Gare”, “Gare, Paris, France”…only one results (definitely not a Gare) came. Then I tried “train station”…many results came! BUTTTT, some subway stations still existed.
1. It seems to be confusing the regional train (that connects the extended Paris) with the traditional train (that goes outside Paris). Gare is only given the stations of the later category
2. Looking back, the results were actually decent, but by that point I had become too frustrated with Google’s API documentation to consider using it. No compromises.
  1. Just make sure to have a suitable location bias…things will get wild otherwise.
Finally, I checked online. SNCF,, the operator of the French railways, has a list of all the Gares in France, ready to export. Although, after careful investigation, I found the “Gare de Lalande-Église” in Toulouse listed, which is permanently closed since 2016 (its name now is “Ancienne Gare de Lalande-Église”. There is no filter on such closed stations, and I don’t know how many of them (and other issues) exist…but I was ready to give up this fight by now.

This makes the problem more tractable. For Paris for example, instead of 1181 stations (in the author’s original post), or 247 after correcting for the OSM applying the proper tags and filters as mentioned earlier (which didn’t work well to my liking still), now it is only 42.

This should improve also the analysis…

The dense-city problem

Given the large number of stations the author took into account, I believed that this will lead to distorted results. The problem with dense cities is that trains and metro are all very close to each other. The furthest from one station is probably the closest to another station.

So when analyzing “the distance to the nearest station entrance for each establishment”, one can run into many unwelcome overlaps.

By committing to the analyzing “Gare” only, the problem of density resolves itself. The stations - even in a packed city like Paris - are sufficiently far away from each other. Sure, you will argue that in Paris, “Gare de l’Est” and “Gare du Nord” are almost across the street from each other. But that is not that bad. Other than this cluster, this case doesn’t repeat itself.

Adjusted rating

It is problematic to compare ratings to each other, given that the number of reviews are not the same.

So, I am going to correct this. I can create a new adjusted rating that will take the number and distribution of reviews into account…

Life will be amazing, justice to the poor, viva la France, yaaayyyy…..

Except…the API can return 5 reviews only at max. That took a couple of hours of debugging…

So, unfortunately, I will have to rely on the final rating only.

Collecting the restaurant data

After some trial and error, I found the author’s approach of just searching for the word Kebab to be sound. Thus I followed it.

Given the limited number of station points that I am using (tighter definition), this results were 198 restaurants for Paris, which is manageable.

Distance measurement: Simplified assumptions

Since GIS is not my strong item, and I that didn’t think this ultra accuracy / data integrity is necessary (I mean, I am using a Google API that doesn’t provide comprehensive results, with ratings of questionable quality, with no access to the individual ratings, let alone the review), I am going to assume that the relevant distance is euclidean distance (thus ditching the “walkable” distance). Why?

Large cities in France tend to be very dense and walkable. Thus, I suspect (without providing any evidence) that the euclidean distance will correlate nicely with the walkable distance.
A problem can arise when there is a river inside the city. This can screw up things badly. But, normally if the river / canal is narrow, there are crossing in front of the station (anecdotal evidence, I need to find a better way to tackle this). This is not the case if the river is wide, then we are fine (far away is truly far away).

Notes on the GIS data

From the original post:

all data was projected to EPSG:32631 (UTM zone 31N).

What the hell is does any of that mean?

After some googling, apparently the longitude and latitude are not great to calculate distance between geographical points. To correct for that, you need to project to the Universal Transverse Mercator (UTM) coordinate system, which it is a horizontal position representation that ignores altitude and treats the earth surface as a perfect ellipsoid. It divides the world into 60 zones, each represents a plane with its own coordinates based on the World Geodetic System (WGS 84).

You can find out the relevant zone from many website like this. Both Paris and Toulouse are in UTM 31N zone.

These zones have ISO-like codes called EPSG , which are better for computers (I guess). Once you have the UTM zone, you can get the EPSG code from here.

In France, Paris and Toulouse are in the north hemisphere N in zone 31 , thus UTM 31N , which also has the EPSG code of 32631 .

How the hell did he get the Paris Polygon?

Google Maps: Rest API vs Python SDK

What a weird question! Both are the same!

Yet, for the life of me, with the python SDK, repeating the author’s original step never worked for me.

Screenshot 2025-03-08 at 14.13.53.png

Several hours later…no resolution.

Accidentally, I realized that when using cURL with the URL directly works pretty well!

So, to move on my with life, I wrote a wrapper around it instead.

Results

To recap, I want to get the assess the relationship between the restaurant rating and the inverse of the distance between that restaurant and the train station.

I decided initially on a distance cut-off of 1000 m (who wants to go that far anyway for a kebab?). While calculating the coefficients, I also trained a random forest classifier (distance and an input, and the ratings thresholded by their median as the output), in order to keep an eye on non-linear relationships.

The main non-linear source I had in mind is thresholding effect: that the relationship between the distance and the rating changes after a certain distance cut-off. Thus, the more positive the coefficient, the stronger the support for the initial claim.

With the distance cut-off of 1000 m, the coefficients in general seemed low (with a couple of exceptions), while the F1 score of the classifier was high enough (you can check the logs here). This triggered me to try 500 meters, which indeed resulted in different picture (albeit at the cost of reducing data points).

City	Nb of data points at 1000 m	Coeff at 1000 m	p-value at 1000	Nb of data points at 500 m	Coeff at 500 m	p-value at 500
Paris	350	-0.018	0.7406	109	0.119	0.4761
Toulouse	37	-0.087	0.6089	6	-0.474	0.5371
Lyon	38	-0.128	0.4436	14	-0.104	0.2054
Marseille	30	-0.056	0.7676	23	0.031	0.6218
Lille	20	-0.102	0.6677	1	-	-
Bordeaux	2	1.0	1.0	0	-	-
Nantes	4	0.221	0.7793	3	0.314	0.6674
Strasbourg	11	0.606	0.048	4	0.744	0.5865
Rennes	18	0.023	0.928	2	1.0	1.0
Montpellier	20	0.08	0.7384	11	0.352	0.3628

Now, looking at these results, while the coefficient results are interesting in case of 500 m, there is clearly weak statistics all over place, except for Strasbourg!

Only Strasbourg at 1000 m (and 750 m, not reported here) except a strong pattern that indeed, the closer to the train station, the worse the kebab.

Thus, the logical conclusions from this utter waste of time are:

France, in general, doesn’t have a kebab issue near the train station. More scientifically, these is not enough evidence to refuse that “there is no linear relationship” hypothesis…but it just it doesn’t look promising.
Strasbourg is German 😅

So, indeed there is no sufficient evidence for a linear relationship between the distance to the station and the quality of the kebab…

Technically speaking, we addressed the raised issue, given the available data…

But what about non-linear relationship?

Exploring the non-linearity

(this is mental-prostitution section, feel free to skip it if you have anything to live for)

P-value is extremely limited concept, in the sense that it assumes a linear relationship.

I am a big believe in predictive power analysis. Basically, build a good machine learning predictor model, and infer some aspects about the data from the model performance / characteristics.

Now, to do this properly, one should try different models and hyper-parameters, but this is an overkill for this problem, so I will just use Random Forests with default settings from scikit-learn .

I am also not a big fan of studying regression models (continuous is overrated), so I will make a modification here. I will create threshold the ratings based on the median of the restaurants ratings in that city. Thus, the new score is 1 (good kebab) if the ratings is above the median, 0 (bad kebab) otherwise. This gives me a nice balanced dataset.

To get some statistical sense, I report the results on a 5-fold cross validation.

Interestingly, in the majority of cases, the model performs better (or far better) than random!

To be extra careful (since the balance can be 45-55% sometimes), I am going to report the F1-score (the things that we chose to care about….).

But since we are at it, why only focusing on the distance? we can study the angle from the station as well. Why? My hunch was that one side of the station is not the same as the other side

For the sake of brevity, I will report the average F1 scores on the 1000 meter cut-off only here

City	F1 score: distant only	F1 score: distant + angle
Paris	0.735	0.773
Toulouse	0.438	0.453
Lyon	0.667	0.48
Marseille	0.579	0.603
Lille	0.393	0.213
Bordeaux	-	-
Nantes	-	-
Strasbourg	0.693	0.867
Rennes	0.133	0.593
Montpellier	0.333	0.493

Except in Lille and Lyon, clearly having the direction of the restaurant provides a significant value.

We can no claim there is a pattern in Paris - Marseille - Strasbourg - Rennes between the rating of the kebab restaurant and its distance / distance & angle from station.

But what does this look like? Let’s use our X-ray googles, and look at the decision boundaries of the model. Let’s start with Paris, with both distance and angle being used.

Paris_decision_boundary_angle:True_1000m.png

For understanding, the station exists at distance 0 and angle 0. Color yellow indicate a good Kebab, and 0 for dark…whatever it is…indicate a bad kebab.

Now this is cool! Basically there are areas / clusters for good kebab, and clusters for bad ones. It is like looking at rooms of a large flat, with each room being occupied with people of similar interests.

Looking at Strasbourg, that clearly explain a lot of things!

Strasbourg_decision_boundary_angle:True_1000m.png

So there is a couple of good kebabs very close (almost inside) the train station, then bad kebabs after that, the good kebab after the bad ones (this just reminded me of poor Kif in Futurama: the beautiful women, then the large women, then the petite women… 😃).

So probably the saying should be: the quality of kebab depends non-linearly on both the distance and angle of the restaurant from the train station….

I am sure this is going to be catchy one day in the common culture.

Last thoughts

First of all, that was fun. Many thank for the original author (James Pae) for starting this :)

Second, there is a good indication the French people indeed live up to their reputation, in perceiving non-linear relationship between the distance and the kebab quality :D

Finally, while surely this is just Kebab, it is really disturbing to me that we can’t have access to existing higher quality data (like the Google reviews, or the individual ratings). In parallel with the “right to repair” move, we need the “right to know” move. These leftover datasets are of limited use. While not everyone is expected to dig in the data, some will, to help ourselves make a better and informed decisions.

Code

	"""
	The core code for the kebab restaurant analysis project.

	Notes:
	- Given that I couldn't get the Google Maps python SDK to work, but the curl requests work perfectly (!!), this is a wrapper with "requests" to make the requests to the Google Maps API.
	"""

	from dotenv import load_dotenv

	load_dotenv()
	import requests
	import json
	import os
	from time import sleep
	import osmnx as ox
	from geopandas import GeoDataFrame
	import pandas as pd
	import numpy as np
	import matplotlib.pyplot as plt
	import json
	import requests
	import geopandas
	from sklearn.model_selection import cross_validate
	from sklearn.ensemble import RandomForestClassifier
	from scipy.stats import pearsonr

	# Constants
	PROJECTION = 32631
	DISTANCE_THRESHOLD = 1000
	# DISTANCE_THRESHOLD = 750
	# DISTANCE_THRESHOLD = 500

	def get_stations_from_file(city: str):
	"""
	Load the "liste-des-gares.geojson" file as a GeoDataFrame.
	You can download it from here:
	https://ressources.data.sncf.com/explore/dataset/liste-des-gares/export/?location=6,46.90879,1.85167&basemap=jawg.transports
	"""
	stations = GeoDataFrame.from_file("liste-des-gares.geojson")
	city = city.upper()
	# Print the shape of the GeoDataFrame
	# Print all the headers, one by one
	# Make sure the city is in the "COMMUNE" column
	assert city in stations["commune"].str.upper().values
	# Filter the stations on the city
	stations = stations[stations["commune"].str.upper().str.contains(city)]
	# Filter on VOYAGEURS == "O"
	stations = stations[stations["voyageurs"] == "O"]
	# Filter on FRET == "N"
	stations = stations[stations["fret"] == "N"]
	print(f"Number of stations in {city}: {stations.shape[0]}")

	return stations


	def get_data(text_query: str, lat: float, lon: float, radius: float = 500.0):
	"""
	If the response has "nextPageToken" in it, then it will make another request to get the next page of results.
	"""
	request_body = {
	"textQuery": text_query,
	"pageSize": 20,
	"locationBias": {
	"circle": {
	"center": {"latitude": lat, "longitude": lon},
	"radius": radius,
	}
	},
	}

	"""
	Example of the request header:
	-H 'Content-Type: application/json' -H "X-Goog-Api-Key: ADD-KEY-HERE" \
	-H "X-Goog-FieldMask: places.id,places.displayName,places.formattedAddress,places.types,places.location"
	"""
	request_headers = {
	"Content-Type": "application/json",
	"X-Goog-Api-Key": os.getenv("GMAPS_API_KEY"),
	"X-Goog-FieldMask": "places.id,places.displayName,places.formattedAddress,places.types,places.location,places.rating,places.reviews",
	}

	url = "https://places.googleapis.com/v1/places:searchText"

	# Make the request
	all_responses = []

	response = requests.post(url, headers=request_headers, json=request_body)
	response_data = json.loads(response.text)
	all_responses += response_data["places"]

	# While there is a nextPageToken, make another request
	while "nextPageToken" in response_data:
	request_body["pageToken"] = response_data["nextPageToken"]
	response = requests.post(url, headers=request_headers, json=request_body)
	response_data = json.loads(response.text)
	# all_responses.append(response_data['places'])
	all_responses += response_data["places"]
	sleep(2)
	sleep(2)

	return all_responses


	def clean_gmap_results(data: list):
	"""
	Given the data from the API, which is a list of dictionaries, we want to return a one dictionary, with "id" of each place as the key, and the rest of the data as the value.
	"""
	clean_data = {}
	for item in data:
	id = item["id"]
	clean_data[id] = item
	# Remove the "id" key from the dictionary
	del clean_data[id]["id"]
	# Add the latitude and longitude keys
	clean_data[id]["latitude"] = clean_data[id]["location"]["latitude"]
	clean_data[id]["longitude"] = clean_data[id]["location"]["longitude"]
	del clean_data[id]["location"]
	# Add display name
	clean_data[id]["display_name"] = clean_data[id]["displayName"]["text"]
	del clean_data[id]["displayName"]
	# Convert the types (a list) into a string (to be able to store it in the database)
	clean_data[id]["types"] = ", ".join(clean_data[id]["types"])
	# Get the number of reviews, and the average rating. Then remove the reviews key.
	if "reviews" in clean_data[id]:
	reviews = clean_data[id]["reviews"]
	clean_data[id]["all_ratings"] = [
	float(review["rating"]) for review in reviews
	]
	del clean_data[id]["reviews"]
	else:
	clean_data[id]["all_ratings"] = 0
	return clean_data

	def get_all_restaurant_data(city: str = "Paris", fresh_data: bool = False):
	"""
	Given the stations, we want to get the restaurant data for each station.
	"""
	# Check if the file exists
	force_fresh_data = False
	if not os.path.exists(f"{city.lower()}_all_restaurant_data.json"):
	# raise FileNotFoundError(f"{city.lower()}_all_restaurant_data.json not found.")
	force_fresh_data = True
	all_restaurant_data = {}
	if fresh_data or force_fresh_data:
	stations = get_stations_from_file(city)
	all_restaurant_data = {}
	for i, row in stations.iterrows():
	lat, lon = row.geometry.y, row.geometry.x
	print(row["libelle"] + ":", lat, lon)
	data = get_data("kebab", lat, lon, radius=1000.0)
	clean_data = clean_gmap_results(data)
	all_restaurant_data.update(clean_data)

	# Save the data to a file
	with open(f"{city.lower()}_all_restaurant_data.json", "w") as f:
	json.dump(all_restaurant_data, f, indent=4)
	else:
	# Load the data from the file
	with open(f"{city.lower()}_all_restaurant_data.json", "r") as f:
	all_restaurant_data = json.load(f)

	return all_restaurant_data


	def calculate_distances(city: str):
	"""
	Load the stations from the file as GeoDataFrame, and load the restaurant data from the file as a dictionary.
	Hash the name of the station (libelle) to the station data.
	Convert the restaurant data to a GeoDataFrame.
	Calculate the distance between each station and each restaurant.
	"""
	stations = get_stations_from_file(city)
	all_restaurant_data = get_all_restaurant_data(city=city, fresh_data=False)
	# Hash the name of the station to the station data
	station_hash = {}
	for i, row in stations.iterrows():
	station_hash[row["libelle"]] = row

	# Convert the restaurant data to a GeoDataFrame
	restaurant_data = []
	for item in all_restaurant_data:
	restaurant = all_restaurant_data[item]
	restaurant_data.append(
	{
	"name": restaurant.get("display_name", ""),
	"latitude": restaurant.get("latitude", 0),
	"longitude": restaurant.get("longitude", 0),
	"types": restaurant.get("types", ""),
	"rating": restaurant.get("rating", 0),
	}
	)
	restaurant_data = pd.DataFrame(restaurant_data)
	# Remove the restaurants that have a rating of 0
	restaurant_data = restaurant_data[restaurant_data["rating"] > 0]
	# Convert the restaurant data to a GeoDataFrame
	restaurant_data = GeoDataFrame(
	restaurant_data,
	geometry=geopandas.points_from_xy(
	restaurant_data.longitude, restaurant_data.latitude
	),
	crs="EPSG:4326",
	)
	# Project the GeoDataFrame to the same CRS as the stations
	restaurant_data = restaurant_data.to_crs(PROJECTION)
	# Project the GeoDataFrame stations to the same CRS as the restaurant data
	stations = stations.to_crs(PROJECTION)
	# Calculate the euclidean distance between each station and each restaurant
	all_distances = []
	for i, single_station in stations.iterrows():
	# print(single_station["libelle"], single_station.geometry)
	for j, single_restaurant in restaurant_data.iterrows():
	# print(single_station["libelle"], single_restaurant["name"], single_restaurant.geometry)
	distance = single_station.geometry.distance(single_restaurant.geometry)
	# Calculate the angle between the station and the restaurant: do the math from the station to the restaurant
	angle = np.arctan2(
	single_restaurant.geometry.y - single_station.geometry.y,
	single_restaurant.geometry.x - single_station.geometry.x,
	)
	all_distances.append(
	{
	"station": single_station["libelle"],
	"restaurant": single_restaurant["name"],
	"distance": distance,
	"angle": angle,
	"restaurant_rating": single_restaurant["rating"],
	"train_station_latitude": single_station.geometry.y,
	"train_station_longitude": single_station.geometry.x,
	"restaurant_latitude": single_restaurant.geometry.y,
	"restaurant_longitude": single_restaurant.geometry.x,
	}
	)
	all_distances = pd.DataFrame(all_distances)
	print(
	f"Number of restaurants X stations, before filtering: {all_distances.shape[0]}"
	)
	# Remove the restaurants that are more than 1000m away from the station
	all_distances = all_distances[all_distances["distance"] <= DISTANCE_THRESHOLD]
	print(
	f"Number of restaurants X stations, after filtering: {all_distances.shape[0]}"
	)
	# Get median of ratings, and create a new column for the ratings, thresholded by the median
	median_rating = all_distances["restaurant_rating"].median()
	print(f"Median rating: {median_rating}")
	all_distances["rating_threshold"] = all_distances["restaurant_rating"].apply(
	lambda x: 1 if x >= median_rating else 0
	)

	# Plot the ratings histogram
	# plt.hist(all_distances["restaurant_rating"], bins=20)
	# plt.show()
	# # Plot the distance histogram
	# plt.hist(all_distances["rating_threshold"], bins=20)
	# plt.show()
	# # Plot rating_threshold vs distance
	# plt.scatter(all_distances["distance"], all_distances["rating_threshold"])
	# plt.xlabel("Distance (m)")
	# plt.ylabel("Rating threshold")
	# plt.show()

	# Calculate the correlation between distance and restaurant_rating
	correlation = np.corrcoef(
	1 / all_distances["distance"], all_distances["restaurant_rating"]
	)
	# correlation = np.corrcoef(all_distances["distance"], all_distances["restaurant_rating"])
	correlation = np.round(correlation, 3)
	print(f"Correlation between distance and restaurant_rating: {correlation[0, 1]}")

	# Perform the Wilcoxon test
	# Calculate the Wilcoxon test
	try:
	stat, p = pearsonr(
	# 1 / all_distances["distance"], all_distances["restaurant_rating"]
	1 / all_distances["distance"],
	all_distances["rating_threshold"],
	)
	print(f"Statistics={stat}, p = {np.round(p, 4)}")
	# Interpret the results
	alpha = 0.05
	if p > alpha:
	print("Same distribution (fail to reject H0)")
	else:
	print("Different distribution (reject H0)")
	except Exception as e:
	print(f"Data is too small to apply the statistical test: {e}")

	return all_distances


	def predictive_power_analysis(
	all_distances: pd.DataFrame, city: str, use_angle: bool = False
	):
	"""
	We went to find out if there is a non-linear relationship between the distance and the rating of the restaurant.
	We will use a non-linear model (random forest) to predict the rating of the restaurant based on the distance.
	Apply cross_validate to get the accuracy and f1 score.
	"""
	# Split the data into training and testing data
	if use_angle:
	X = all_distances[["distance", "angle"]]
	else:
	X = all_distances[["distance"]]
	y = all_distances["rating_threshold"]

	# Count and print the number of 1s and 0s in the rating_threshold column
	print(
	f"Percentage of labels (ratings split based on the median rating): {y.value_counts(normalize=True).round(3)}"
	)
	# Train the model
	model = RandomForestClassifier()
	ignore_ml_model = False
	try:
	cross_val = cross_validate(model, X, y, cv=5, scoring=["accuracy", "f1"])
	except Exception as e:
	print(f"Data is too small to apply cross_validate: {e}")
	ignore_ml_model = True

	if not ignore_ml_model:
	print("Cross validation results:")
	for key in cross_val:
	if "time" in key:
	continue
	mean, std = np.mean(cross_val[key]), np.std(cross_val[key])
	# Round the mean and std to 3 decimal places
	mean, std = np.round(mean, 3), np.round(std, 3)
	print(key, mean, std)

	# Fit the model
	model.fit(X, y)
	plt.figure()

	if use_angle:
	########################################################
	# Plot the decision boundary - In case of 2 features (distance and angle)
	########################################################
	# Create a meshgrid
	x_min, x_max = X["distance"].min() - 1, X["distance"].max() + 1
	y_min, y_max = X["angle"].min() - 1, X["angle"].max() + 1
	xx, yy = np.meshgrid(
	np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1)
	)
	# Predict the values
	Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
	Z = Z.reshape(xx.shape)
	# Plot the decision boundary
	plt.contourf(xx, yy, Z, alpha=0.4)
	plt.scatter(X["distance"], X["angle"], c=y, s=20, edgecolor="k")
	plt.xlabel("Distance (m)")
	plt.ylabel("Angle")
	# Colorbar
	plt.colorbar()
	else:
	########################################################
	# Plot the decision boundary - In case of 1 feature (distance)
	########################################################
	# Create a meshgrid
	x_min, x_max = X["distance"].min() - 1, X["distance"].max() + 1
	xx = np.linspace(x_min, x_max, 1000)
	# Predict the values
	Z = model.predict(xx.reshape(-1, 1))
	# Plot the decision boundary
	plt.plot(xx, Z, color="r")
	plt.scatter(X["distance"], y, s=20, edgecolor="k")
	plt.xlabel("Distance (m)")
	plt.ylabel("Rating threshold")

	plt.title(
	f"Decision boundary - {city} - Model F1 score: {np.round(cross_val['test_f1'].mean(), 3)} - Distance threshold: {DISTANCE_THRESHOLD}m"
	)
	# plt.show()
	plt.savefig(
	f"{city}_decision_boundary_angle:{use_angle}_{DISTANCE_THRESHOLD}m.png"
	)


	if __name__ == "__main__":
	data = get_all_restaurant_data(city="Paris", fresh_data=True)
	# print(data)
	for city in [
	"Paris",
	"Toulouse",
	"Lyon",
	"Marseille",
	"Lille",
	"Bordeaux",
	"Nantes",
	"Strasbourg",
	"Rennes",
	"Montpellier",
	]:
	print(f"Calculating distances for {city}...")
	all_distances = calculate_distances(city)
	print("" 50)
	print("Let's discover non-linear relationships...")
	print("" 50)
	for use_angle in [False, True]:
	print(f"Using angle: {use_angle}")
	predictive_power_analysis(all_distances, city, use_angle=use_angle)
	print("-" * 100)
	print("\n")
	# break

view raw kebab_analysis.py hosted with ❤ by GitHub