Jacaard Index, Similarity based on Tag System


/ Published in: SQL
Save to your folder(s)

This piece of sql is used to find the similarity of one item to other items based on a 3 table tag system, cities, cities_tags, tags. It works in sQlite and MySQL. Use this code to get the idea of how you would make a "item" has "related items" based on "tags" solution.


Copy this code and paste it in your HTML
  1. BEGIN TRANSACTION;
  2.  
  3. ----
  4. -- Drop table for cities
  5. ----
  6. DROP TABLE "cities";
  7.  
  8. ----
  9. -- Table structure for cities
  10. ----
  11. CREATE TABLE 'cities' ('id' INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 'name' TEXT);
  12.  
  13. ----
  14. -- Data dump for cities, a total of 3 rows
  15. ----
  16. INSERT INTO "cities" ("id","name") VALUES ('1','Paris');
  17. INSERT INTO "cities" ("id","name") VALUES ('2','London');
  18. INSERT INTO "cities" ("id","name") VALUES ('3','New York');
  19.  
  20. ----
  21. -- Drop table for tags
  22. ----
  23. DROP TABLE "tags";
  24.  
  25. ----
  26. -- Table structure for tags
  27. ----
  28. CREATE TABLE 'tags' ('id' INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 'tag' TEXT);
  29.  
  30. ----
  31. -- Data dump for tags, a total of 3 rows
  32. ----
  33. INSERT INTO "tags" ("id","tag") VALUES ('1','Europe');
  34. INSERT INTO "tags" ("id","tag") VALUES ('2','North America');
  35. INSERT INTO "tags" ("id","tag") VALUES ('3','River');
  36.  
  37. ----
  38. -- Drop table for cities_tags
  39. ----
  40. DROP TABLE "cities_tags";
  41.  
  42. ----
  43. -- Table structure for cities_tags
  44. ----
  45. CREATE TABLE 'cities_tags' ('id' INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, 'city_id' INTEGER NOT NULL, 'tag_id' INTEGER NOT NULL);
  46.  
  47. ----
  48. -- Data dump for cities_tags, a total of 5 rows
  49. ----
  50. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('1','1','1');
  51. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('2','1','3');
  52. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('3','2','1');
  53. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('4','2','3');
  54. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('5','3','2');
  55. INSERT INTO "cities_tags" ("id","city_id","tag_id") VALUES ('6','3','3');
  56. COMMIT;
  57.  
  58.  
  59. ----
  60. -- Now the fun, when we are looking at city 1 ( Paris ) want to find the
  61. -- more similar city to Paris by looking at the tags.
  62. -- we use a jacqaard index to compare the sets and ratios
  63. -- city_id=1 is Paris and should be change to variable...
  64. -- city_id!=1 is also making sure that we don't include paris in the results...
  65. ----
  66.  
  67.  
  68.  
  69. SELECT c.name,
  70. CASE
  71. WHEN not_in.cnt IS NULL
  72. THEN
  73. INTERSECTION.cnt * 1.0 /(SELECT COUNT(tag_id) FROM cities_tags WHERE city_id=1)
  74. ELSE
  75. INTERSECTION.cnt/(not_in.cnt+(SELECT COUNT(tag_id) FROM cities_tags WHERE city_id=1))
  76. END AS jaccard_index
  77. FROM cities c
  78. INNER JOIN
  79. (
  80. SELECT city_id, COUNT(*) AS cnt
  81. FROM cities_tags
  82. WHERE tag_id IN (SELECT tag_id FROM cities_tags WHERE city_id=1)
  83. AND city_id!=1
  84. GROUP BY city_id
  85. ) AS INTERSECTION
  86. ON c.id=INTERSECTION.city_id
  87. LEFT JOIN
  88. (
  89. SELECT city_id, COUNT(tag_id) AS cnt
  90. FROM cities_tags
  91. WHERE city_id!=1
  92. AND NOT tag_id IN (SELECT tag_id FROM cities_tags WHERE city_id=1)
  93. GROUP BY city_id
  94. ) AS not_in
  95. ON c.id=not_in.city_id
  96. ORDER BY jaccard_index DESC

URL: http://www.itsgotto.be/cv

Report this snippet


Comments

RSS Icon Subscribe to comments

You need to login to post a comment.