Skip to content

Commit 39b0796

Browse files
committed
Initial commit for GitHub
0 parents  commit 39b0796

File tree

4 files changed

+254
-0
lines changed

4 files changed

+254
-0
lines changed

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
*~
2+
.~*#
3+
.nfs*
4+
*.mat
5+
*.fig
6+
*.aux
7+
*.log
8+
*.blg
9+
*.out
10+
*.pdf
11+
*.gz
12+
*.ods
13+
*.eps

README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# User Manual
2+
3+
### Summary
4+
5+
A Matlab/Octave script which generates 2D data for clustering; data is
6+
created along straight lines, which can be more or less parallel
7+
depending on the selected input parameters.
8+
9+
### Synopsis
10+
11+
[data, clustPoints, idx, centers, slopes, lengths] =
12+
generateData(slope, slopeStd, numClusts, xClustAvgSep,
13+
yClustAvgSep, lengthAvg, lengthStd, lateralStd,
14+
totalPoints)
15+
16+
### Input parameters
17+
18+
Parameter | Description
19+
-------------- | ------------------------------------------------------------------------------------------------------
20+
*slope* | Base direction of the lines on which clusters are based
21+
*slopeStd* | Standard deviation of the slope; used to obtain a random slope variation from the normal distribution, which is added to the base slope in order to obtain the final slope of each cluster
22+
*numClusts* | Number of clusters (and therefore of lines) to generate
23+
*xClustAvgSep* | Average separation of line centers along the X axis
24+
*yClustAvgSep* | Average separation of line centers along the Y axis
25+
*lengthAvg* | The base length of lines on which clusters are based
26+
*lengthStd* | Standard deviation of line length; used to obtain a random length variation from the normal distribution, which is added to the base length in order to obtain the final length of each line
27+
*lateralStd* | "Cluster fatness", i.e., the standard deviation of the distance from each point to the respective line, in both *x* and *y* directions; this distance is obtained from the normal distribution
28+
*totalPoints* | Total points in generated data (will be randomly divided among clusters)
29+
30+
### Return values
31+
32+
Value | Description
33+
------------- | --------------------------------------------------------------------------------------
34+
*data* | Matrix (*totalPoints* x *2*) with the generated data
35+
*clustPoints* | Vector (*numClusts* x *1*) containing number of points in each cluster
36+
*idx* | Vector (*totalPoints* x *1*) containing the cluster indices of each point
37+
*centers* | Matrix (*numClusts* x *2*) containing centers from where clusters were generated
38+
*slopes* | Vector (*numClusts* x *1*) containing the effective slopes used to generate clusters
39+
*lengths* | Vector (*numClusts* x *1*) containing the effective lengths used to generate clusters
40+
41+
### Usage example
42+
43+
[data cp idx] = generateData(1, 0.5, 5, 15, 15, 5, 1, 2, 200);
44+
45+
The previous command creates 5 clusters with a total of 200 points, with
46+
a base slope of 1 (*std*=0.5), separated in average by 15 units in both
47+
*x* and *y* directions, with average length of 5 units (*std*=1) and a
48+
"fatness" or spread of 2 units.
49+
50+
To take a quick look at the clusters just do:
51+
52+
scatter(data(:,1), data(:,2), 8, idx);
53+
54+
### Reference
55+
56+
If you use this script in your work, please use the following reference:
57+
58+
- Fachada, N., Figueiredo, M.A.T., Lopes, V.V., Martins, R.C., Rosa,
59+
A.C., [Spectrometric differentiation of yeast strains using minimum volume
60+
increase and minimum direction change clustering criteria](http://www.sciencedirect.com/science/article/pii/S0167865514000889),
61+
Pattern Recognition Letters, vol. 45, pp. 55-61 (2014), doi: http://dx.doi.org/10.1016/j.patrec.2014.03.008
62+
63+
### License
64+
65+
This script is made available under the [Simplified BSD License](license.txt).
66+

generateData.m

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
function [data, clustPoints, idx, centers, slopes, lengths] = ...
2+
generateData( ...
3+
slope, ...
4+
slopeStd, ...
5+
numClusts, ...
6+
xClustAvgSep, ...
7+
yClustAvgSep, ...
8+
lengthAvg, ...
9+
lengthStd, ...
10+
lateralStd, ...
11+
totalPoints ...
12+
)
13+
% GENERATEDATA Generates 2D data for clustering; data is created along
14+
% straight lines, which can be more or less parallel depending
15+
% on slopeStd argument.
16+
%
17+
% [data clustPoints idx centers slopes lengths] =
18+
% GENERATEDATA(slope, slopeStd, numClusts, xClustAvgSep, yClustAvgSep, ...
19+
% lengthAvg, lengthStd, lateralStd, totalPoints)
20+
%
21+
% Inputs:
22+
% slope - Base direction of the lines on which clusters are based.
23+
% slopeStd - Standard deviation of the slope; used to obtain a random
24+
% slope variation from the normal distribution, which is
25+
% added to the base slope in order to obtain the final slope
26+
% of each cluster.
27+
% numClusts - Number of clusters (and therefore of lines) to generate.
28+
% xClustAvgSep - Average separation of line centers along the X axis.
29+
% yClustAvgSep - Average separation of line centers along the Y axis.
30+
% lengthAvg - The base length of lines on which clusters are based.
31+
% lengthStd - Standard deviation of line length; used to obtain a random
32+
% length variation from the normal distribution, which is
33+
% added to the base length in order to obtain the final
34+
% length of each line.
35+
% lateralStd - "Cluster fatness", i.e., the standard deviation of the
36+
% distance from each point to the respective line, in both x
37+
% and y directions; this distance is obtained from the
38+
% normal distribution.
39+
% totalPoints - Total points in generated data (will be
40+
% randomly divided among clusters).
41+
%
42+
% Outputs:
43+
% data - Matrix (totalPoints x 2) with the generated data
44+
% clustPoints - Vector (numClusts x 1) containing number of points in each
45+
% cluster
46+
% idx - Vector (totalPoints x 1) containing the cluster indices of
47+
% each point
48+
% centers - Matrix (numClusts x 2) containing centers from where
49+
% clusters were generated
50+
% slopes - Vector (numClusts x 1) containing the effective slopes
51+
% used to generate clusters
52+
% lengths - Vector (numClusts x 1) containing the effective lengths
53+
% used to generate clusters
54+
%
55+
% ----------------------------------------------------------
56+
% Usage example:
57+
%
58+
% [data cp idx] = GENERATEDATA(1, 0.5, 5, 15, 15, 5, 1, 2, 200);
59+
%
60+
% This creates 5 clusters with a total of 200 points, with a base slope
61+
% of 1 (std=0.5), separated in average by 15 units in both x and y
62+
% directions, with average length of 5 units (std=1) and a "fatness" or
63+
% spread of 2 units.
64+
%
65+
% To take a quick look at the clusters just do:
66+
%
67+
% scatter(data(:,1), data(:,2), 8, idx);
68+
69+
% N. Fachada
70+
% Instituto Superior Técnico, Lisboa, Portugal
71+
72+
% Make sure totalPoints >= numClusts
73+
if totalPoints < numClusts
74+
error('Number of points must be equal or larger than the number of clusters.');
75+
end;
76+
77+
% Determine number of points in each cluster
78+
clustPoints = abs(randn(numClusts, 1));
79+
clustPoints = clustPoints / sum(clustPoints);
80+
clustPoints = round(clustPoints * totalPoints);
81+
82+
% Make sure totalPoints is respected
83+
while sum(clustPoints) < totalPoints
84+
% If one point is missing add it to the smaller cluster
85+
[C,I] = min(clustPoints);
86+
clustPoints(I(1)) = C + 1;
87+
end;
88+
while sum(clustPoints) > totalPoints
89+
% If there is one extra point, remove it from larger cluster
90+
[C,I] = max(clustPoints);
91+
clustPoints(I(1)) = C - 1;
92+
end;
93+
94+
% Make sure there are no empty clusters
95+
emptyClusts = find(clustPoints == 0);
96+
if ~isempty(emptyClusts)
97+
% If there are empty clusters...
98+
numEmptyClusts = size(emptyClusts, 1);
99+
for i=1:numEmptyClusts
100+
% ...get a point from the largest cluster and assign it to the
101+
% empty cluster
102+
[C,I] = max(clustPoints);
103+
clustPoints(I(1)) = C - 1;
104+
clustPoints(emptyClusts(i)) = 1;
105+
end;
106+
end;
107+
108+
% Initialize data matrix
109+
data = zeros(sum(clustPoints), 2);
110+
111+
% Initialize idx (vector containing the cluster indices of each point)
112+
idx = zeros(totalPoints, 1);
113+
114+
% Initialize lengths vector
115+
lengths = zeros(numClusts, 1);
116+
117+
% Determine cluster centers
118+
xCenters = xClustAvgSep * numClusts * (rand(numClusts, 1) - 0.5);
119+
yCenters = yClustAvgSep * numClusts * (rand(numClusts, 1) - 0.5);
120+
centers = [xCenters yCenters];
121+
122+
% Determine cluster slopes
123+
slopes = slope + slopeStd * randn(numClusts, 1);
124+
125+
% Create clusters
126+
for i=1:numClusts
127+
% Determine length of line where this cluster will be based
128+
lengths(i) = abs(lengthAvg + lengthStd*randn);
129+
% Determine how many points have been assigned to previous clusters
130+
sumClustPoints = 0;
131+
if i > 1
132+
sumClustPoints = sum(clustPoints(1:(i - 1)));
133+
end;
134+
% Create points for this cluster
135+
for j=1:clustPoints(i)
136+
% Determine where in the line the next point will be projected
137+
position = lengths(i) * rand - lengths(i) / 2;
138+
% Determine x coordinate of point projection
139+
delta_x = cos(atan(slopes(i))) * position;
140+
% Determine y coordinate of point projection
141+
delta_y = delta_x * slopes(i);
142+
% Get point distance from line in x coordinate
143+
delta_x = delta_x + lateralStd * randn;
144+
% Get point distance from line in y coordinate
145+
delta_y = delta_y + lateralStd * randn;
146+
% Determine the actual point
147+
data(sumClustPoints + j, :) = [(xCenters(i) + delta_x) (yCenters(i) + delta_y)];
148+
end;
149+
% Update idx
150+
idx(sumClustPoints + 1 : sumClustPoints + clustPoints(i)) = i;
151+
end;

license.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Copyright (c) 2012, Nuno Fachada
2+
All rights reserved.
3+
4+
Redistribution and use in source and binary forms, with or without
5+
modification, are permitted provided that the following conditions are
6+
met:
7+
8+
* Redistributions of source code must retain the above copyright
9+
notice, this list of conditions and the following disclaimer.
10+
* Redistributions in binary form must reproduce the above copyright
11+
notice, this list of conditions and the following disclaimer in
12+
the documentation and/or other materials provided with the distribution
13+
14+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
15+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
17+
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
18+
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
19+
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
20+
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
21+
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
22+
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
23+
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
24+
POSSIBILITY OF SUCH DAMAGE.

0 commit comments

Comments
 (0)