Index and distance from Jaccard

the index and the distance from Jaccard are two metric S used in Statistiques to compare the Similarité and the Diversité between sample S. They are named according to the Swiss botanist Paul Jaccard.

Formal description

the index of Jaccard (or coefficient of Jaccard ) is the relationship between the cardinality (size) of the intersection of the Ensemble S considered and the cardinality of the union of the units. It makes it possible to evaluate the similarity between the units. That is to say two A units and B, the index is:

J (has, B) = \ frac.

The extension to n units is commonplace:

J (S_1, S_2,…, S_n) = \ frac.

The distance from Jaccard measurement dissimilarity enters the units. It simply consists in withdrawing the index of Jaccard from 1.

J_ {\ delta} (has, B) = 1 - J (has, B) = {{ |With \ cup B| - |With \ course B| } \ over |With \ cup B| } .

Same manner that for the index, generalization becomes:

J_ {\ delta} (S_1, S_2,…, S_n) = 1 - J (S_1, S_2,…, S_n) = \ frac.

Similarity enters of the binary units

The index of Jaccard is useful to study the similarity between objects made up of binary attributes.

That is to say two sequences A and B, each one with n binary attributes. Each attribute can be to 0 or 1. One has as follows:

has = (a_1, a_2,…, a_n) ~

B = (b_1, b_2,…, b_n) ~

One defines several quantities which characterize the two units:

M_ {11} ~ represents the number of attributes which are worth 1 in has and in B

M_ {01} ~ represents the number of attributes which are worth 0 in has and 1 in B
M_ {10} ~ represents the number of attributes which are worth 1 in has and 0 in B
M_ {00} ~ represent the number of attributes which are worth 0 in has and in B

Each pair of attributes must necessarily belong to the one of the four categories, so that:

M_ {11} + M_ {01} + M_ {10} + M_ {00} = N ~.

The index of Jaccard becomes:

J = {M_ {11} \ over M_ {01} + M_ {10} + M_ {11}}

The distance from Jaccard becomes:

J_ {\ delta} = {M_ {01} + M_ {10} \ over M_ {01} + M_ {10} + M_ {11}}

Example

HAS = (1,0,1,0,0,0,0) ~
B = (1,0,0,1,0,1,1) ~

M_ {11} = 1 ~

M_ {00} = 2 ~
M_ {01} = 3 ~
M_ {10} = 1 ~

J = \ frac {1} {3 + 1 + 1} = 0,2

J_ {\ delta} = \ frac {3+1} {3 + 1 + 1} = 0,8 = 1 - J

See too

References

  • Tan Pang-Ning, Michael Steinbach and Vipin KUMAR, Introduction to Data Mining (2005), ISBN 0-321-32136-7
  • Paul Jaccard (1901) Bulletin of the Company Of Vaud of the Natural science 37,241-272.
  • Tanimoto, T.T. (1957) IBM Internal Carryforward 17th Nov. 1957.

External bonds

  • index of Jaccard and diversity between species
  • Example of coefficient of Jaccard
  • Introduction to the excavation of data
  • SimMetrics, an implementation of metric of similarity

Random links:Route main road 88 | Communes of the province of Teramo | Female main road 1A 1997-1998 | Park of Viry-Noureuil | Mary Peters (athlete) | Devenant,_partie_une