Differences between Jaccard similarity and Cosine Similarity

Jaccard Similarity and Cosine Similarity are two common methods used to measure the similarity between two sets or two vectors. They are used in different scenarios and have different formulas.

  1. Jaccard Similarity: Jaccard Similarity is primarily used for sets or binary data. It compares the presence or absence of elements between two sets. The formula for Jaccard Similarity is:
J(A, B) = |A ∩ B| / |A ∪ B|

Where:

  • |A ∩ B| is the size of the intersection of sets A and B.
  • |A ∪ B| is the size of the union of sets A and B. Jaccard Similarity ranges from 0 (no common elements) to 1 (identical sets).

2. Cosine Similarity: Cosine Similarity is commonly used for vector representations, particularly in text analysis or document comparisons. It calculates the cosine of the angle between two vectors. The formula for Cosine Similarity is:

cos(θ) = (A · B) / (||A|| ||B||)

Where:

  • A · B is the dot product of vectors A and B.
  • ||A|| and ||B|| are the magnitudes (or lengths) of vectors A and B. Cosine Similarity ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonality.

Uses:-

   <?php

        if ($_SERVER["REQUEST_METHOD"] == "POST") {
            $user_content = isset($_POST["user_content"]) ? $_POST["user_content"] : '';

            if (!empty($user_content)) {

                // echo "Searching for: $user_content<br>";
                // Your Google API key
                $apiKey = '';

                // Your Custom Search Engine ID
                $cx = '';

                // Function to fetch Google search results using Custom Search JSON API with pagination
                function fetchGoogleResults($query, $apiKey, $cx, $startIndex = 1)
                {
                    // Define the number of results per page
                    $resultsPerPage = 10;

                    // Calculate the start index for the current page
                    $start = ($startIndex - 1) * $resultsPerPage + 1;

                    // Build the API request URL with the start parameter
                    $url = "https://www.googleapis.com/customsearch/v1?key=$apiKey&cx=$cx&q=" . urlencode($query) . "&start=$start";

                    // Fetch the results using file_get_contents
                    $response = file_get_contents($url);

                    return $response;
                }

                // Fetch and extract Google search results from multiple pages
                $matches = [];

                for ($page = 1; $page <= 5; $page++) { // You can adjust the number of pages as needed
                    $google_results_page = fetchGoogleResults($user_content, $apiKey, $cx, $page);
                   
                    $google_results_page_data = json_decode($google_results_page, true);
                   
                    // Check if there are items in the current page
                    if (isset($google_results_page_data['items'])) {
                        foreach ($google_results_page_data['items'] as $item) {
                            
                            // Tokenize the strings
                            $user_tokens = array_count_values(preg_split('/\W+/u', $user_content, -1, PREG_SPLIT_NO_EMPTY));
                            $google_tokens = array_count_values(preg_split('/\W+/u', $item['snippet'], -1, PREG_SPLIT_NO_EMPTY));
                           
                            // Count the matching words
                            $matching_words = array_intersect_key($user_tokens, $google_tokens);
                            $total_matching_words = array_sum($matching_words);
                    
                            // Calculate Jaccard similarity (check for division by zero)
                            $jaccard_percentage = (count($user_tokens) > 0) ? ($total_matching_words / count($user_tokens)) * 100 : 0;
                    
                            // Calculate cosine similarity
                            $dotProduct = 0;
                            $norm1 = 0;
                            $norm2 = 0;
                            foreach ($user_tokens as $word => $count) {
                                $dotProduct += $count * ($google_tokens[$word] ?? 0);
                                $norm1 += $count ** 2;
                            }
                            foreach ($google_tokens as $count) {
                                $norm2 += $count ** 2;
                            }
                    
                            // Check for division by zero before calculating cosine similarity
                            $cosine = ($norm1 * $norm2 > 0) ? $dotProduct / sqrt($norm1 * $norm2) : 0;
                            $cosine_percentage = $cosine * 100;
                    
                            // Add match to array
                            $matches[] = [
                                'title' => $item['title'],
                                'link' => $item['link'],
                                'snippet' => $item['snippet'],
                                'jaccard' => $jaccard_percentage,
                                'cosine' => $cosine_percentage
                            ];
                        }
                    }
                    
                }

                // Calculate overall plagiarism and uniqueness based on the most similar match
                $max_cosine = max(array_column($matches, 'cosine'));

                $plagiarism_percentage = $max_cosine;
                $uniqueness_percentage = 100 - $plagiarism_percentage;

                // Display overall results
                echo "<h2>Overall Plagiarism Check Results</h2>";
                echo "<p>Plagiarism: " . number_format($plagiarism_percentage, 2) . "%</p>";
                echo "<p>Unique: " . number_format($uniqueness_percentage, 2) . "%</p>";

                // Display detailed matches
                // Sort the matches by cosine similarity in descending order
                usort($matches, function ($a, $b) {
                    return $b['cosine'] - $a['cosine'];
                });

                // Display only the top 5 matches
                $top_matches = array_slice($matches, 0, 5);

                // Display overall results
                echo "<h2>Top 5 Matches</h2>";

                if (!empty($top_matches)) {
                    // Display detailed matches
                    foreach ($top_matches as $match) {
                        echo "<div>";
                        echo "<h3>" . $match['title'] . "</h3>";
                        echo "<p>" . $match['snippet'] . "</p>";
                        echo "<p>Jaccard Similarity: " . number_format($match['jaccard'], 2) . "%</p>";
                        echo "<p>Cosine Similarity: " . number_format($match['cosine'], 2) . "%</p>";
                        echo "<a href='" . $match['link'] . "' target='_blank'>Read More</a>";
                        echo "</div>";
                        echo "<hr>";
                    }
                } else {
                    echo "<p>No matches found.</p>";
                }

                // Clear the textarea content
                $_POST["user_content"] = "";
            } else {
                echo "<p>Please provide content for plagiarism check.</p>";
            }
        }

        ?>

Result:-

Related Posts

Modern Data Operations: A Practical DataOps Platform Implementation Guide

Introduction Modern data ecosystems are expanding at an unprecedented rate. Centralized databases have given way to distributed cloud data warehouses, real-time data streaming architectures, and multi-cloud data…

Read More

Data Pipeline Optimization Techniques for Low-Latency Data Analytics

Introduction In a fast-paced digital economy, the shelf life of data value is shorter than ever. Businesses no longer have the luxury of waiting for overnight batch…

Read More

The Best AIOps Training Program Guide For Cloud Engineers

As modern IT environments transition from centralized datacenters to highly distributed, multi-cloud, and microservices-based setups, the sheer volume of data generated by enterprise software has exploded. Infrastructure…

Read More

Connect Directly with Trusted Local Experts Using Professnow Marketplace

The local service market is highly fragmented, making it difficult to verify a provider’s background, past work, or true capabilities before they show up at your door….

Read More

Accelerating Analytics Delivery by Automating Data Validation with DataOps Tools

Introduction In the modern digital economy, high-quality, trusted data serves as the foundation for critical enterprise decisions. Organizations rely heavily on business intelligence, machine learning models, and…

Read More

How Predictive Monitoring Platforms Optimize Modern DataOps and Data Observability

Introduction Traditional monitoring systems are no longer equipped to handle this level of complexity. Legacy tools depend entirely on static thresholds, which flag problems only after a…

Read More
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x