Big Data Clustering Techniques: Literature Review

May 4, 2024

Writing Thesis

5 min read

2. CRITICAL SURVEY OF LITERATURE

This chapter gives an overview of the Big Data clustering techniques and the distance and the similarity measures used by them. Clustering using power method and its convergence techniques are described. The need of Hadoop to handle Big Data is been explained.

2.1 MACHINE LEARNING

Machine learning is the method of analyzing data that uses algorithm that learns continuously from the data without being programmed explicitly. They learn themselves when they are exposed to new data. Machine learning is similar to that of data mining to search for the patterns in the data [20]. Machine learning uses these data to detect patterns and adjust program actions accordingly. Facebook, web search engine, smart phones, page ranking etc. uses machine learning algorithms for its processing. It constructs program with the characteristics that automatically adjusts itself to increase the performance using the existing data. The motivation of machine learning is to learn automatically and identify the patterns to take decisions based on them. Machine learning is divided three methods namely [42],

Supervised learning or Predictive learning
Unsupervised learning or Descriptive learning
Semi supervised learning.

2.1.1 Supervised Learning

Supervised learning also called as the predictive learning is a method of mapping the inputs and the outputs, given a set of training set

∑j=1Paijand

ajtotal number ofspecies j

∑i=1Paij. Then, the Chi-square distance is given as

Xih2=∑j=1p1aijahjah-aijai2

Chebyshev Distance (Maximum valuedistance) – It is the defined as the vector space where the distance between any two vectors is greatest of the distance along any coordinate dimension. It is call as max metric or L metric [50,52].

1	1	1
1	1	1
1	1	1

It is the maximum distance between the points in a single dimension. The formula computes the distance between two points

X=x1x2∙∙∙∙∙xnand

Y=y1y2∙∙∙∙∙ynis

Dchebp,q=maxi=xj-yj

Where

xiand

yjare the values of the

jthvariable at points X and Y. This is equal to the limit

limk→∞⁡∑1n(pi-qik)1k

Bit Vector Distance – Let there be a N*N matrix then the bit vector distance is found at each point having d dimensions using M_b(P_i,P_j) as a d bit vector. If the x^th dimension of point is greater than the numerical value the bit X of M_b (P_j,P_i) is set to 0.

Mahalanobis distance – The Mahalanobis distance is given based on the correlation structure of the data points and the individual points [53,54]. The Mahalanobis distance is computed as

Mi=(yi-y̅)S-1(yi-y̅)

where

yi= data of the i^th row

Y – Row of the means

S – the estimated covariance matrix

Let X be an N*p matrix. Then the i^th row of X is

xiT=(xi1……xip). The Mahalanobis distance is given as

dMH(i,j)=(xi-xj)T Σ-1xi-xj

Here Σ is the

p×psample covariance matrix. To perform it on quadratic multiplication, the mean difference is obtained and multiplied by the covariance after obtaining its transpose.

T²Distance Measure – It is the square of Mahalanobis distance. So

Ti2=Mj2. The T²Distance measure is given by the formula[52,54]

UCLT2=(n-1)2n1-β:p2:n-p-12

Where n is the number of observations and p is the number of variables.

2.4 SIMILARITY MEASURES

It is a metric used to measure the similarity between two objects. It compares the similarity and diversity of sample sets. It is defined as the size of the intersection divided by the size of sample sets. Given two sequence of measurements

X=x1,x2,∙∙∙∙∙,xnand

Y=y1,y2,∙∙∙∙∙,yn, the dependencies between X and Y give the similarity between them. The similarity measure can be defined as the function that becomes the degree between a pair of text objects. The value of similarity lies between 0 and 1. If the value is close to 0 then, the instance or data objects are said to be similar and if it is one they are dissimilar. A metric is said to be similar if it satisfies the following properties [55,56].

Limited range – The value of the data points is less than the largest similarity measure
S(X,Y)≤S0
Reflexivity
SX,Y=S0 if and only if X=Y

Symmetric
SX,Y=SY,X
Triangle Inequality –
SX,YSY,Z≤ZX,Y+SY,ZS(X,Z)

The metric is said to be asymmetric if it satisfies the following properties.

Non Negative
D(X,Y)≥0
Reflexive
DX,Y=0if and only if

X=Y
Symmetric
DX,Y=D(Y,X)
Triangle Inequality
DX,Y+D(Y,Z)≥D(X,Z)

In spectral clustering similarity measure is used to convert data to overcome the issues related to lack of convexity in the form of the data distribution. There are various similarity measures and some of them are explained below.

Jaccard Similarity – It measures the similarity between the objects of binary values. It is specially used for handling asymmetric binary attributes. It is usually done by choosing the number of attributes that both objects have in common and divide it by the total number of objects. The jaccard similarity is given by

JA,B=A∩BA∪B = A∩BA+B-A∩B

where {A,B} is the given sets.

Arepresents the number of elements.

A∩Brepresents the number of elements on both sets

A∪Brepresents the number of elements on either sets

Cosine Similarity – It is used to measure the similarity between two vectors by measuring the cosine of angle between them. When the angle is 0, the cosine function is equal to 1 and it’s less than one for other values. If the angle between the vectors shortens then there is more similarity between the vectors [53,57]. The formula for cosine similarity is given as

Sx,y=cos⁡θ=x.yx*y

Dice Similarity – It is the similarity measure related to Jaccard index. Given two datasets X and Y, then the coefficient is given as

S=2|x∩y|x+|y|

The coefficients are calculated for two strings x and y as

S=2ntnx+ny

where

2ntis the number of character bigrams in X and Y,

Your thesis deserves more than a rushed final draft.

A compelling thesis requires a defensible argument, rigorous evidence, and polished academic writing. Our thesis specialists work with master's and doctoral students to produce well-structured model documents that meet committee and institutional expectations.

✓ Plagiarism-free · ✓ 100% human · ✓ Free revisions · ✓ Confidential

Get Thesis Writing Help

🔒 No payment to start · From 3 hrs

nxis the number of bigrams in X,

nyis the number of bigrams in Y.

Hamming Distance – It is the distance between the strings of equal length in which the number of position of the corresponding symbols are different. The hamming distance between the 2 variables (x,y) is given as dH(x,y) to the number of places where x and y are different , it can be given as the number of bits or characters that has to be altered to change one string to another. Hence the Hamming distance between two vectors

x,y∈F(n)is defined by the number of coefficients in which they differ. It measures the minimum number of substitutions needed to change one string to another and the minimum error that occurs during the transformation. Here D should satisfy the condition.

d(x, y) = 0, iff x,y agrees to all coordinates.
The number of coordinates in which x differ from y.
The number of coordinates in which y differ from x.
d(x, y) is equal to the minimum number of changes in coordinates to get from x to y.

2.5 POWER METHOD

Power method is a simple iterative method to find the dominant eigen value and eigen vector for a matrix A where λ is the largest of eigen value and

v0is the largest of eigen vector. The power method is found using[58]

vt+1=cWvt

Where

vt is the vector at iteration t.

W is the Square matrix

c is the normalizing constant to keep

vtfrom getting too large or too small.

Since the power method for approximating the eigen value is iterative the initial approximation x₀of thedominant eigen vector is chosen. The sequence of eigen vector is given as [59,60]

x1=Ax̅0

x2=Ax̅1

= A (

Ax̅0) =

A2x̅0

x3=Ax̅2

A(A2x̅0)=

A3x̅0

xk=Ax̅k

A(Akx̅0)=

Akx̅0

A good approximation is obtained for larger values of k. To analyze the convergence of the power method a detailed formulation is shown. Given an N*N matrix A, the eigen vector corresponding to the eigen value is identified. Let λ₀, λ₁, λ₂….. λ_n-1 be the eigen values of A. The largest eigen value is λ₀ i.e. |λ₀|>|λ₁|>|λ₂|……………|λ_n|. Let there be an eigen pair (λ_j,

vj) and assume that the vectors

v0,

v1, v2….

vn-1be linearly independent. To start with we take an initial vector at a time 0. The vector is written as a combination of the vector that is linearly independent. Here x⁽⁰⁾is given as a linear combination of eigen vectors. Hence,

x⁽⁰⁾ = γ₀

v₀+ γ₁

v₁ + γ₂

v₂ …………. γ_n-1

v_n-1.

If the first vector is written with a matrix A we create

x⁽¹⁾=Ax⁽⁰⁾

= A(γ₀

v₀+ γ₁

v₁+ γ₂

v₂ …. γ_n-1

v_n-1)

= γ₀A

v₀ + γ₁A

v₁+ γ₁A

v₁…… + γ_n-1A

v_n

Similarly

x(2)=Ax(1)

= A(

γ0Av₀+

γ1Av₁+

γ2 Av₂ ….

γn-1A v_n-1)

But A

v₀is λ₀

Your thesis deserves more than a rushed final draft.

✓ Plagiarism-free · ✓ 100% human · ✓ Free revisions · ✓ Confidential

Get Thesis Writing Help

🔒 No payment to start · From 3 hrs

v₀and A

v1is

λ1v1since the eigen vectors are chosen[59].

A(γ0λ0v0+γ1λ1v1+…..γn-1λn-1vn-1)

=γ0Aλ0v0+γ1Aλ1v1+…..γn-1Aλn-1vn-1)

Here,

Aλ0v0=λ02v0,

Aλ1v1=λ12v1.. . . . …….

Aλn-1xn-1=λn-12vn-1

On repeating for higher values we get,

xk+1=Axk

γ0λ0kv0+γ1λ1kv1+……γn-1λn-1kvn-1

Here

Aλ0kv0=λ0k+1v0,

Aλ2kv2=λ2k+1v2…….

Aλn-1kvn-1=λn-1k+1vn-1

xk+1=γ0λ0k+1v0+γ1λ1k+1v1+……γn-1λn-1k+1vn-1

So the general equation becomes

limk→∞⁡xk=

limk→∞(

γ0λ0kv0+

γ1λ1kv1+

γ2λ2kv2……….

γn-1λn-1kvn-1)

Here

γ0λ0kv0is largest in magnitude and it dominates other vectors. The direction of

v₁ is insignificant and vector

v₀ keeps on increasing. To fix the increasing value

v₀ the vectors should be kept under control. So it is rewritten as [60]

x1=1λ0Ax0

= 1λ0A(γ0v0+γ1v1……..γn-1vn-1)

=γ0 1λ0 Av0

+γ01λ0Av0………….+γn-11λ0Avn-1

Here

Av0=λ₀

v₀on substitution we get,

=γ0 1λ0 (λ0v0)

γ0v0

Similarly,

Av1= λ1v1=

γ1v1.Hence the first term becomes

γ0v0.

x1=γ0v0+λ1λ0v1+…+λn-1λ0vn-1

x2=1λ0Ax1

x2=1λ0A(γ0v0+γ1λ1λ0v1+……. γn-1λn-1λ0vn-1

γ0v0+λ1λ02v1……………

λn-1λ02vn-1

In this case the first term remains the same for any value of x and the next term has the factor

λ1λ02and so on. Here

λ1λ0is the magnitude less than 1. Repeating for k steps we get

xk+1=γ0v0+γ1λ1λ0kv1

……………

γn-1λn-1λ0k+1vn-1

So we get,

limk→∞⁡xk=limk→∞⁡(γ0v0+γ1λ1λ0kv1…………⁡+γn-1λn-1λ0kvn-1)

Here the value

λ1λ0is less than 1 and when raised to the power of k they eventually go to 0. Hence, the 2^nd, 3^rd and the n^th term gradually goes to 0 and left with the first term. So, we get [59,60]

=γ0v0+γ10v1+……γn-1(0)vn-1

=γ0v0

Finally, the component in the direction of

Your thesis deserves more than a rushed final draft.

✓ Plagiarism-free · ✓ 100% human · ✓ Free revisions · ✓ Confidential

Get Thesis Writing Help

🔒 No payment to start · From 3 hrs

v0is constant and the component in direction of

v1shrinks and points to

v0. Due to normalization the vector avoids itself from growing itself too long or short. Eigen vector is finding the iteration involved and it’s about its length. Thus it is derived that the process produces a vector in the direction that is leading in magnitude. Hence, the power method is given as Ax = λx.

X_i+1=

Axi||Axi|| λ _i+1=

Axi

2.5.1 Convergence Analysis

The power method does not compute matrix decomposition. Hence, it can be used for larger sparse matrix. The power method converges slowly since it finds only one eigen vector. The starting vector x₀ in terms of eigenvector of A is given by

x ₀=

∑i=1naivi.

Then, x_k=Ax_k-1=A²x_{k-2 =}……A^kx₀

∑i=1nλikaivi= λnkanvn+∑i=1n-1λ1λnkaivi

The higher the power of

λ1λnit goes to 0.That is, it converges if

λ1is dominant and if q⁰ has a component in the direction of corresponding eigen vector x₁. Power method can also fail to converge in case if |

λ1|=|λ2|. The convergence rate of power iteration depends on the dominant ratio of the operator or the matrix which is challenging when the dominance ratio is close to one [59].

2.5.2 Drawbacks of the Power Method

If the initial column vector
x0is an eigenvector of A other than corresponding to

ƛ1the dominant eigen value, then it fails or converges to a wrong value [60,61].
The convergence of the power method depends on the magnitude of the dominant eigenvalue
ƛ1and the magnitude of next largest eigenvalue. If the ratio is small then convergence become slow or can even fail if

ƛ1=ƛ1̅
It gives only one pseudo eigenvalue and its difficult when it happens to be more than one dimension.
It is complicated in the case of many eigen values.
The power method needs a initial guess.

2. 6 CLUSTERING ALGORITHM USING POWER METHOD

Power Iterative Clustering (PIC) is an algorithm that clusters data using power method which is used to find the largest of eigen vector a combination of the eigen vectors in a linear manner. PIC provides a low dimensional embedding of datasets. It operates on a graph of nodes connected by weighted edges. The steps involved in PIC includes[6]

a. Similarity matrix calculation which is required in spectral clustering methods is replaced by matrix vector multiplication

b. Normalization to keep the vector constant and not making to converge or diverge and

c. Iterative matrix vector multiplication where the matrix W is combined with the vector v₀to obtain Wv₀.

2.6.1 Convergence Analysis

Power iteration converges to a matrix W but the results are not significant. To analyze the convergence of power iteration clustering algorithm, let us consider a matrix W with eigen values

v1,v2∙∙∙∙∙vnand eigen vector

ƛ1,ƛ2..…..ƛn. The PIC algorithm performs power iteration for less number of iteration, an stops at its convergence limit within the clusters before reaching its final convergence. The acceleration at t is given as

δt=vt-vt-1and the vector

∈t=δt-δt-1. The threshold value is

∈̂and the iteration is stopped when

∈tα≤ ∈̂. Hence the convergence can be said as power iteration clustering algorithm converges fast locally and it is slow globally[62]. The convergence rate of the algorithm towards on the dominant eigenvector

ƛ1depends on

ƛiƛ1tfor i=2…k i.e .

ƛiƛ1t≃1since the eigenvectors are close to one.

2.6.2 Disadvantages of Power Iteration Clustering

Power Iteration Clustering is fast, efficient and scalable. It has a better ability of handling larger datasets. Both requires data matrix and similarity matrix to be fitting into the processors memory that is infeasible for very larger datasets. Its challenge lies in calculation and storage of larger matrices [6,62]. Moreover PIC finds only one eigen vector which leads to inter collision problem. Lin and Cohen [6] discovered an interesting property of the largest eigenvector of W, before the elements of

vtconverge to the constant value, they first converge to local centers that corresponds the clusters in the data[63]. PIC converges quickly at local minima but it’s slow at global minima.

2.7 EXTRAPOLATION METHODS

The requirement for finding the approximate value of the sequence is significant for any algorithm to converge. An important problem that arises often is the computing limits of the vector

xn, where

xn∈CN, N is the dimension and its huge. Such sequence converges slowly or can even diverge. This condition occurs generally for iterative solutions of any linear systems. Hence, the computational complexity is high. In order to address these problems, convergence acceleration techniques like extrapolation methods has to be applied. An extrapolation method produces a new sequence

Ânfrom a given sequence

Anthat converges fast if it has a limit. If it does not have a limit the sequence can either converge or diverge slowly. Some of the most common extrapolation techniques include Richardson extrapolation, Fixed Point Iteration, Aitken Delta Square method and the Steffensen method [8].

2.7.1 Fixed Point Iteration

It is used to determine the roots of the given function. It computes the fixed point of the given iterated function [64,65]. Given a function f with real value an d points

x0in f, the fixed point iteration is given as

xn+1=f(xn)

where n = 0,1,2…..

which has the sequence

x0, x1,…

,xnconverges to the fixed point x. Let g be a continuous function in the interval (a, b) the fixed point is calculated by constantly evaluating g at points (a,b).

2.7.2 Aitkens Technique

It is a method to speed up the convergence of a sequence that is linearly convergent. It accelerates the convergence of any sequence [62]. The advantage of this method is that it improves the linear convergence of any iterative methods.

2.7.3 Steffensen Technique

The method of combining Aitken’s process with the fixed point iteration is called the Steffensen’s acceleration technique. It attains quadratic convergence. Steffensen’s method can be programmed for a generic function on certain constraint [66]. The main drawback of this method is that the value of

x0should be chosen correctly, else the method fails.

2.8 BIG DATA ANALYTICS

2.8.1 Data Deluge

It is a situation where the incoming data is more than the capacity to be managed. The rate of data is growing at a faster rate day by day. The data was about 165 Exabyte in the year 2007. It has reached to 800 Exabyte in the year 2009. There was an increase of upto 62% than the previous year. Moreover in the year 2009 the usage of the internet has been increased by 50% per year and hence the data rate doubles every 2 years[67]. In the year 2011, the term Big Data was introduced by McKinsey Global institute. The scale dataset increased from Gigabytes to Terabytes and even to Petabytes. According to the report from IDC (International Data Cooperation) in 2011, the data created was 1.8 ZB and it has increased 9 times in these 5 years. At present 2.5 quintillion bytes of data are created every day. Google process 100 PB of data, Twitter 12 PB. These huge quantity of data’s are being obtained from social networks, bank transactions, audio and videos etc. in you tube etc.

2.8.2 Big Data Storage Mechanism

Existing storage mechanisms of Big Data may be classified into three bottom-up levels namely file systems, databases, programming models [68].

File Systems: File systems are the foundation of the applications at upper levels and Google’s GFS is an expandable distributed file system to support large-scale, distributed, data-intensive applications.

Databases: Various database systems are developed to handle datasets at different scales and support various applications and MySQL technology for Big Data. The following three main MySQL database are key-value databases, column-oriented databases, and document-oriented databases.
Programming Models- The programming model is critical to implementing the application logics and facilitating the data analysis applications. However, it is difficult for traditional parallel models to implement parallel programs on a Big Data scale.

Big Data Storage –It is a storage method designed to store, manage and to process large amount of data. This data is stored such that it can be accessed easily and processed by applications to work on Big Data. The data storage subsystem in a Big Data platform organizes the collected information in a convenient format for analysis and value extraction. Existing massive storage technologies can be classified as

Direct Attached Storage (DAS)
Network storage

Direct Attached Storage (DAS): In DAS, numerous hard disks are connected directly with servers, and data management is held on the server rather than individual machines. The storage devices are peripheral equipment, each of which takes a certain amount of I/O resource and is managed by individual application software. It connects only limited size. The DAS storage is shown in Fig 2.1.

Your thesis deserves more than a rushed final draft.

✓ Plagiarism-free · ✓ 100% human · ✓ Free revisions · ✓ Confidential

Get Thesis Writing Help

🔒 No payment to start · From 3 hrs

Tags:

Writing Thesis

Need help with your assignment?

Expert writers available now. Original work, no AI, free revisions.

Get My Paper Done

🔒 No payment to start · Free revisions · Money-back guarantee

4.9 ★

Student rating

8,400+

Papers delivered

97%

On-time delivery

Our Services

Study Guides

Once Upon A Time English Literature EssayNovember 5, 2025
Role of Neuromedins in Pain and Signal Transduction PathwaysNovember 5, 2025
Tracking Dermal Drug DeliveryNovember 4, 2025
Analysis of Team Nursing Delivery ModelNovember 4, 2025
Probing the Competitive Antagonism Between O-GlcNAcylation and Phosphorylation Modifications of the Nucleosome at Histone H2B-Ser36November 4, 2025

Why students choose Scholaris

100% human writing, no AI
Plagiarism report with every order
Deadlines from 3 hours
Money-back guarantee
Free unlimited revisions

Related Study Guides

Ways Online Class Coursework Help Can Improve Your Grade

Online Class Coursework Help As technology continues to transform every aspect of life, online classes have become increasingly popular. The COVID-19 pandemic also accelerated the shift to online […]

April 6, 2023
Read →: Ways Online Class Coursework Help Can Improve Your Grade
Expert DNP Capstone Project Writers: Get High-Quality and Customized Nursing Papers Today

DNP Capstone Project Writers Do you need of assistance with your DNP Capstone project? Look no further, as we have a team of highly qualified DNP Capstone project […]

April 6, 2023
Read →: Expert DNP Capstone Project Writers: Get High-Quality and Customized Nursing Papers Today
Once Upon A Time English Literature Essay

I have been assigned an English poetry coursework. I was given the option of choosing one poem from the following two: ‘I am not that women’ and ‘Once […]

November 5, 2025
Read →: Once Upon A Time English Literature Essay
Role of Neuromedins in Pain and Signal Transduction Pathways

Neuromedin U and Its Role in Pain Signaling and Inflammation Abstract Neuromedins are a fascinating family of structurally conserved peptides initially isolated from the porcine spinal cord. Researchers […]

November 5, 2025
Read →: Role of Neuromedins in Pain and Signal Transduction Pathways
Tracking Dermal Drug Delivery

6. Tracking Dermal Drug Delivery Stimulated Raman scattering microscopy was used to gauge the permeation of topically applied pharmaceuticals and formulation solvents into pig dermis. This chemically selective […]

November 4, 2025
Read →: Tracking Dermal Drug Delivery

Big Data Clustering Techniques: Literature Review

Our Services

Study Guides

Related Study Guides

Ways Online Class Coursework Help Can Improve Your Grade

Expert DNP Capstone Project Writers: Get High-Quality and Customized Nursing Papers Today

Once Upon A Time English Literature Essay

Role of Neuromedins in Pain and Signal Transduction Pathways

Tracking Dermal Drug Delivery