Clustering Techniques for Categorical Data: Correspondence Analysis

Clustering Techniques for Categorical Data: Correspondence Analysis

Stelios Zimeras, Manolis Kalligeris
DOI: 10.4018/978-1-7998-5442-5.ch004
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Categorical data are generally thought to consist of contingency tables, which are data tables created whenever categorical data are cross-classified. Correspondence analysis is a statistical visualization method for picturing the associations between the levels of a two-way contingency table. In this work, the correspondence analysis for categorical data is analyzed using statistical and mathematical techniques when big data must be organized and analyzed due to the huge amount of data, and a MATLAB package (simple correspondence analysis) for this technique is presented.
Chapter Preview
Top

Background

Categorical data analysis is a methodology where data, especially big data, can be summarized based on contingency tables. Due to this process, analysis could be more effective and organization of the final results could be introduced in an easier manner. The contingency tables introduce frequencies in a IxJ table (matrix) where nij is the number of individuals with the particular characteristic, defined as cell frequency.

These frequencies are based on particular distributions where data could be part of a sample. The most important distributions for categorical data analysis are:

Poisson Distribution (Pois(λi))

The Poisson distribution is a discrete probability distribution for the counts of events that occur randomly in a given interval of time (or space). The Poisson model assumes that the number of characteristics Xi = ni (ni ≅ nij) into the cells in a contingency table are Poisson independent random variables (i.i.d.) with(Χi = ni) ~ Poisson(λi), λi > 0, i=1,…,N=IxJwhere λi is the mean number of events per cell. We say X follows a Poisson distribution with parameter λ. The probability density function (The probability of observing ni characteristic in a given cell i(=i,j) is given byP(Χi = ni) = 978-1-7998-5442-5.ch004.m01, i=1,…,Nwith Ε(Χi = ni) = λi and Var(Χi = ni)= λi (Figure 1)

Figure 1.

Graphical presentation of Poisson distribution for different parameters Ν and λ. a. N=15, λ= 5; b. N=80, λ= 3; c. N=200, λ= 50.

978-1-7998-5442-5.ch004.f01

Binomial Distribution (Bin(n,p))

In this case, the final result has two outcomes: success (with probability p), and fail (with probability q=1-p). If X is the number of success in n-trials and p the probability of success then the random variableXi ~ Binomial (n,p)considering the following conditions:

  • 1: The number of observations n is fixed.

  • 2: Each observation is independent.

  • 3: Each observation represents one of two outcomes (“success” or “failure”).

  • 4: The probability of “success” p is the same for each outcome.

The probability density function is given byP(Χi = xi) = 978-1-7998-5442-5.ch004.m02with 978-1-7998-5442-5.ch004.m03, Ε(Χi) = n∙π and Var(Χi) = n∙π∙(1-π) (Figure 2)

Figure 2.

Graphical presentation of binomial distribution for different parameters n and p. a. n=20, p= 0.25; b. n=80, p= 0.5; c. n=200, p= 0.8.

978-1-7998-5442-5.ch004.f02

Complete Chapter List

Search this Book:
Reset