Cosine Similarity (Overview) Cosine similarity is a measure of similarity between two non-zero vectors. When we talk about checking similarity we only compare two files, webpages or articles between them.Comparing them with each other does not mean that your content is 100% plagiarism-free, it means that text is not matched or matched with other specific document or website. So we can take a text document as example. One of such algorithms is a cosine similarity - a vector based similarity measure. With cosine similarity, you can now measure the orientation between two vectors. Convert the documents into tf-idf vectors . If we are working in two dimensions, this observation can be easily illustrated by drawing a circle of radius 1 and putting the end point of the vector on the circle as in the picture below. $J(doc_1, doc_2) = \frac{doc_1 \cap doc_2}{doc_1 \cup doc_2}$ For documents we measure it as proportion of number of common words to number of unique words in both documets. sklearn.metrics.pairwise.cosine_similarity¶ sklearn.metrics.pairwise.cosine_similarity (X, Y = None, dense_output = True) [source] ¶ Compute cosine similarity between samples in X and Y. Cosine similarity, or the cosine kernel, computes similarity as the normalized dot product of X and Y:

If you want, you can also solve the Cosine Similarity for the angle between vectors:

In general,there are two ways for finding document-document similarity. 