Questions tagged [clustering]
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
129 questions
3
votes
0
answers
766
views
Rust code implementing cosine similarity
I've been trying to create a piece of code which consists of looping through each element of a list of questions, preprocess it, and then calculate the Cosine similarity with the rest of the elements (...
1
vote
1
answer
79
views
Dividing shared resources for homogeneous multithread processing
I'm trying to implement a homogeneous multithreading example that multiple threads process portion of a huge task. In order to achieve this, I thought of clustering data/resource and multiple threads ...
1
vote
1
answer
413
views
Implement 2D and 1D std::array in opencl kernel
I am asked to implement the following part of code into kernel code. Actually, I have tried but not sure about the std::array.
This is the original code for the ...
0
votes
2
answers
467
views
Multithreaded implementation of K-means clustering algorithm in Java
Hello I have written a multi-threaded implementation of the K-means clustering algorithm. The main goals are correctness and scalable performance on multi-core CPUs. I expect to code to not have race ...
1
vote
1
answer
252
views
Calculation of the Distance Matrix in the K-Means Algorithm in MATLAB
Purpose of the code :
To assign the corresponding label of the centroids to the points which are close to it. Below is a graphical (2D) example.
Variable X is a matrix, rows represent the points, ...
2
votes
1
answer
297
views
A Tiny Nearest Neighbor Classification Implementation in C#
I am practicing to implement the KNN classification tool in C#. The basic point structure is constructed by the class Point, and there are two members in ...
4
votes
1
answer
318
views
K-clustering algorithm using Kruskal MST with Disjoint Set in place to check for cycles
here below a working implementation that finds the minimal distance between k(set =4 below) clusters in a graph.
I have doubts mainly on the implementation of the ...
6
votes
0
answers
147
views
K nearest neighbours algorithm
Here is a project that I worked on for a few days in June 2020. Since the algorithm is extremely slow, I looked into methods in order to parallelize operations but did not obtain any satisfactory ...
3
votes
1
answer
279
views
Implementation of K-means
I have recently built a class that is an implementation of kMeans from scratch. I believe there is room for improvement and I would happily receive some feedback. The project can be found at: https://...
6
votes
1
answer
607
views
Schelling's model of Segregation Python implementation with Geopandas
If you don't know what is Schelling's model of segregation, you can read it here.
The Schelling model of segregation is an agent-based model that illustrates how individual tendencies regarding ...
5
votes
2
answers
5k
views
Grouping sorted coordinates based on proximity to each other
I created an algotrithm that groups a sorted list of coordinates into buckets based on their proximity (30) to one another.
Steps:
Create a new key with a list ...
1
vote
1
answer
4k
views
Clustering using k-medoids
This is the program function code for clustering using k-medoids
...
2
votes
0
answers
134
views
Machine learning, kNN and Naïve Bayes algorithm
This is the task I am working on:
In this assignment you will implement the K-Nearest Neighbour and Naïve Bayes algorithms and evaluate them on a real dataset using the stratified cross validation ...
1
vote
1
answer
781
views
Welford's online variance calculation algorithm for vectors
I'm developing a face recognizing application using the face_recognition Python library.
The faces are encoded as 128-dimension floating-point vectors. In addition to this, each named known person ...
3
votes
0
answers
557
views
Locality Sensitive Hash (similar to k-Nearest Neighbor), in Python+Numpy
I've tried implementing Locality Sensitive Hash, the algorithm that helps recommendation engines, and powers apps like Shazzam that can identify songs you heard at restaurants.
LSH is supposed to run ...
0
votes
1
answer
67
views
Analyzing distances between clusters of orders [closed]
I wrote the Python class below, which does what I want it to do, but the data structure is a mess. Was wondering if there was a better structure I could use to get the same results but with better ...
1
vote
2
answers
347
views
Top k closest pairs in a set of million 128-dimensional points [closed]
I have a set of 1 million points in 128-dimensional space. Among all trillion pairs of the points in the set, I need to get a subset of 100 million pairs whose cosine distances are less than that of ...
1
vote
1
answer
208
views
Efficiently determining maximum allowed euclidean distance between lists of colors
I was recently tasked with determining which hex RGB colors in list color_list are nearest to each hex RGB color in list target_colors, using euclidean distance as the measuring stick, and only ...
5
votes
1
answer
6k
views
k-means using numpy
This is k-means implementation using Python (numpy). I believe there is room for improvement when it comes to computing distances (given I'm using a list comprehension, maybe I could also pack it in a ...
6
votes
1
answer
4k
views
Closest Pair algorithm implementation in C++
I had been working on the implementation of closest pair algorithm in a 2-D plane.
My approach has been that of divide and conquer O(nlogn) :
...
3
votes
1
answer
106
views
Simple natural language classifier
This program estimates the likelihood for a string to belong to a certain natural language by computing the cosine similarity between an input string's and several natural languages' letter frequency, ...
6
votes
1
answer
526
views
Haskell K-means implementation
The following is a Haskell backend to a K-means visualisation:
I have omitted the API code (exists in a separate module), the relevant endpoints simply call ...
2
votes
1
answer
169
views
R function to generate predictions from ratings
I am trying to improve the run time of a program I wrote in R. Generally, what I am doing is feeding a function a data frame of values and generating a prediction off of operations on specific columns....
2
votes
0
answers
122
views
Regression on Pandas DataFrame
I am working on the following assignment and I am a bit lost:
Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common ...
4
votes
1
answer
415
views
K-Nearest Neighbors in pure Python
I want a general criticism on this code. Using external modules is not an option, I can only use what comes with CPython.
...
1
vote
1
answer
1k
views
Finding the distance between the two closest points in a 2-D plane
Here is my code:
...
0
votes
1
answer
2k
views
Determining the similarity between two documents
I've made some code that reads in text files (which hold quite large vectors of word frequencies), which in turn stores each index of a vector within an ArrayList ...
13
votes
2
answers
5k
views
Removing neighbors in a point cloud
I have written a program to optimize a point cloud in dependency of their distances to each other. The code works very well for smaller number of points. For 1700 points it takes ca. 6 minutes. But I ...
11
votes
2
answers
3k
views
K-Means image segmentation algorithm
I am a new C++ programmer and I have some experience in Python and C but I was almost completely self taught (I learned C++ with OpenClassrooms).
I would like to learn the conventions and how things ...
1
vote
0
answers
7k
views
Fuzzy c Means in Python
This is my implementation of Fuzzy c-Means in Python. In the main section of the code, I compared the time it takes with the sklearn implementation of kMeans.
...
5
votes
1
answer
336
views
Calculation of clustering metric in Python
When I try to run the following code for arrays with more than 10k elements, it takes hours and I don't know how to make it in the most efficient way.
Any ideas?
...
3
votes
0
answers
581
views
Compute distance matrix using DTW acceptable for scipy.cluster.hierarchy
I am new to both data science and python. I have a dataset of the time-dependent samples, which I want to run agglomerative hierarchical clustering on them. I have found that Dynamic Time Warping (DTW)...
2
votes
0
answers
149
views
Cluster tweet texts
I am using using the following code to cluster tweet texts. The input is a dictionary containing tweet-id and tweet text as key value pairs.
Example:
...
8
votes
1
answer
930
views
k-means implementation in python
This is the first mini-project that I'm working on python, where I implement k-means. I'm planning to parallelize it as soon as I've written a good serial version.
Code description:
Below you will ...
3
votes
1
answer
632
views
R - Outlier Detection Algorithm
I am trying to implement an algorithm for detecting outliers in R and I am pretty new to the language. The outlier algorithm is described in this paper in detail on page 10-11, but to summarize it ...
4
votes
1
answer
641
views
DBSCAN "region query" too slow; implement a tree?
My current DBSCAN in Python works...but its indexing is far too slow; its a linear scan:
...
2
votes
0
answers
486
views
A Simple K-Means Cluster Analyzer v0.2
This code is a revision of a previous post and works well.
This my first attempt at creating a robust, computationally lean K-Means Cluster analyzer. I first saw this algorithm in an intermediate ...
4
votes
1
answer
235
views
Grouping orders by similarity (cluster) of their items
My task is to write an algorithm for grouping list of orders into batches, each batch consisting of 4 orders. Orders are grouped by similarity of their items, which means the more items from Order X ...
3
votes
0
answers
2k
views
A Simple K-Means Cluster Analyzer v0.1
[NOTE] This question can be depreciated in favor of version 0.2.
This code works well.
This my first attempt at creating a robust, computationally lean K-Means Cluster analyzer. I first saw this ...
5
votes
1
answer
197
views
A Simple Cluster Generator v0.31
[NOTE] This question can be depreciated in favor of version 0.32.
This is a code revision of a previous post and works well.
The purpose of this code is to produce a universe of points, randomly ...
3
votes
1
answer
179
views
A Simple Cluster Generator v0.3
[NOTE] This question can be depreciated in favor of version 0.31.
This is a code revision of a previous post and works well.
The purpose of this code is to produce a universe of points, randomly ...
6
votes
2
answers
199
views
A Simple Cluster Generator v0.2
[NOTE] This question can be depreciated in favor of version 0.3.
This is a code revision of a previous post and works well.
Code has been reworked to be far more clear and concise, thanks to ...
1
vote
1
answer
165
views
A Simple Cluster Generator v0.1
[NOTE] This question can be depreciated in favor of version 0.2.
The purpose of this code is to produce a universe of points, randomly generated around predetermined centroids, provided as a vector ...
6
votes
1
answer
174
views
Java find minimum range
Question Description:
Given a list of companies, eg {ABC, BBC} and a large list of data looks like below:
...
7
votes
1
answer
1k
views
PANDAS nearest site algorithm
I have got CSVs full of property transactions in the UK from 1995 to 2017, separated by year such as "RS2015.csv". I have a 2nd CSV with a list of wind turbines in the UK. Both have coordinates in WGS ...
4
votes
1
answer
100
views
Getting hex colours from a image
I am trying to get the hex colours from an image. The problem I am having is that for some reason randomly the code causes high CPU usage, which times out the browser and I am not sure how to optimise ...
2
votes
0
answers
64
views
Identifying local peaks among some 2D points
The following MATLAB code takes in multiple peak coordinates and heights and eliminates lesser peaks that are within a certain distance threshold of the highest peak of the vicinity. Is there a better ...
2
votes
1
answer
303
views
Predicting outliers in network data [closed]
I'm implementing KNN algorithm to predict outliers in network data which contains the following columns: source IP address, source port number, protocol and total bytes transferred. To achieve this, I'...
7
votes
1
answer
178
views
Speeding up maximum self-similarity test for heavy tail-exponents
I am trying to reproduce results from a research paper using python. I've checked my method and it works on relatively small sample datasets. However, the code does not run for my actual dataset, ...
1
vote
1
answer
5k
views
DBSCAN c++ implementation
For work I had to implement the DBSCAN algorithm in the 3D space for clusters finding. It works, now I wonder how is the quality of the code. I'm especially concerned about incrementing the size of ...