How Does the K-Nearest Neighbors (KNN) Algorithm Work in Python?
Today, I will try to describe one of the most intuitive and fascinating classification algorithms in its simplicity: the K-Nearest Neighbors, known as KNN.
KNN is based on a simple yet powerful concept: “Tell me who you’re with, and I’ll tell you who you are.” In practical terms, it classifies a new data point based on the k nearest data points in the training set, using Euclidean distance as the metric.
The Nearest Neighbors
Imagine you have a dataset that represents various animals in a zoo, with information like weight, height, and age. When a new animal is added, KNN checks which k animals are closest in terms of Euclidean distance and uses this information to classify it. It’s a simple but highly effective approach.
Euclidean Distance
Euclidean distance represents the “straight line” between two points in Euclidean space. Mathematically, the distance between two points P(x1, y1) and Q(x2, y2) is calculated as:
sqrt((x2 - x1)2 + (y2 - y1)2)
This method easily extends to multi-dimensional spaces, making it suitable for datasets with many features.
Implementation in Python
Let’s see how to put KNN into practice using scikit-learn, a Python library that simplifies the implementation of machine learning algorithms.
# Import the necessary libraries from sklearn.neighbors import KNeighborsClassifier import numpy as np # Create a small example dataset X = np.array([[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]) y = np.array([0, 0, 0, 1, 1]) # Initialize the KNN classifier with k=3 knn = KNeighborsClassifier(n_neighbors=3) knn.fit(X, y) # Predict the class of a new entry new_entry = np.array([[5, 5]]) prediction = knn.predict(new_entry) print("Predicted class:", prediction[0]) # Output: 0 or 1, depending on the nearest neighbors
In just a few lines of code, we created, trained, and used a KNN classifier. This example illustrates how simple it is to implement this algorithm.
Advantages and Limitations of KNN
After exploring how KNN works, it’s important to examine its advantages and limitations.
Advantages of KNN
– Simplicity: Easy to implement and understand.
– No Preliminary Assumptions: Being a non-parametric algorithm, it doesn’t make assumptions about the data distribution.
– Multi-class Adaptability: Easily handles classification problems with multiple classes.
– Versatility: Can be used for both classification and regression problems.
Limitations of KNN
– Computational Efficiency: It can be slow on large datasets as it requires calculating the distance to all points in the set.
– Sensitivity to Outliers: Outliers can negatively affect predictions.
– Curse of Dimensionality: In high-dimensional spaces, Euclidean distance can become less meaningful.
– Choosing the Value of k: Finding the optimal value of k often requires experimentation and validation.
In summary, KNN is a powerful algorithm, but it is essential to understand when and how to use it effectively.
Practical Example: Flower Species Classification with the Iris Dataset
To further illustrate the use of KNN, let’s consider the famous Iris dataset, which contains 150 flower samples divided into three species: Setosa, Versicolor, and Virginica. Each sample includes four features: sepal length and width, and petal length and width.
KNN Implementation in Python
Here’s how to implement KNN to classify flower species:
import numpy as np from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score # Load the Iris dataset iris = load_iris() X = iris.data y = iris.target # Split the dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # Create the KNN model with k=3 knn = KNeighborsClassifier(n_neighbors=3) # Train the model knn.fit(X_train, y_train) # Make predictions y_pred = knn.predict(X_test) # Calculate accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy * 100:.2f}%")
– Dataset: Using the Iris dataset with `load_iris()`.
– Data Preparation: Splitting the data into a training set and test set with `train_test_split()`.
– Model Creation: Initializing the KNN classifier with `n_neighbors=3`.
– Model Training: Training with `fit()`.
– Prediction and Evaluation: Predicting the classes and calculating accuracy with `accuracy_score()`.
This example demonstrates how KNN can be used to solve real-world classification problems.
I am passionate about technology and the many nuances of the IT world. Since my early university years, I have participated in significant Internet-related projects. Over the years, I have been involved in the startup, development, and management of several companies. In the early stages of my career, I worked as a consultant in the Italian IT sector, actively participating in national and international projects for companies such as Ericsson, Telecom, Tin.it, Accenture, Tiscali, and CNR. Since 2010, I have been involved in startups through one of my companies, Techintouch S.r.l. Thanks to the collaboration with Digital Magics SpA, of which I am a partner in Campania, I support and accelerate local businesses.
Currently, I hold the positions of:
CTO at MareGroup
CTO at Innoida
Co-CEO at Techintouch s.r.l.
Board member at StepFund GP SA
A manager and entrepreneur since 2000, I have been:
CEO and founder of Eclettica S.r.l., a company specializing in software development and System Integration
Partner for Campania at Digital Magics S.p.A.
CTO and co-founder of Nexsoft S.p.A, a company specializing in IT service consulting and System Integration solution development
CTO of ITsys S.r.l., a company specializing in IT system management, where I actively participated in the startup phase.
I have always been a dreamer, curious about new things, and in search of “new worlds to explore.”