Posted By

Twain on 10/28/07


Tagged

md5


Versions (?)

Who likes this?

4 people have marked this snippet as a favorite

fukami
tdowg1
cieux-gris
swiftset


Recursive MD5


 / Published in: Python
 

Recursively calculates, checks and updates "md5sum" files. Has renaming detection. Handy for directories of files that aren't suppose to change, but may get renamed.

Also has some code taken from my mp3md5.py to provide an option ignore ID3 tags on .mp3 files. The blog post has some details on how Windows users can integrate it into the Explorer context menu (http://grahamweekly.blogspot.com/2007/04/md5dir.html)

  1. #!/usr/bin/python
  2.  
  3. """md5dir -- Recursive MD5 checksums for files which move around
  4.  
  5. Usage: md5dir [options] [directories]
  6.  
  7. Without options it writes an 'md5sum' file in each subdirectory
  8. containing MD5 checksums for that directories files.
  9.  
  10. During this it outputs progress dots, and then prints out the names of
  11. files which have been added, deleted, renamed or changed since the
  12. last run, and a summary of the number of files in each category.
  13.  
  14. A file which has been both changed and renamed since the last run
  15. shows up as DELETED followed by ADDED.
  16.  
  17. The md5sum files are read and written in the format of the GNU md5sum
  18. utility (http://www.gnu.org/software/textutils/textutils.html).
  19.  
  20. -3/--mp3
  21. Enable MP3 mode: for files ending in .mp3, calculate a checksum
  22. which skips ID3v1 and ID3v2 tags. This checksum differs from the
  23. normal one which is compatible with GNU md5sum. The md5sum file is
  24. tagged so that md5dir will in future always use MP3 mode for the
  25. directory. Consider using mp3md5.py instead, which keeps this
  26. tag-skipping checksum in the ID3v2 tag as a Unique File ID.
  27.  
  28. -c/--confirm
  29. Additionally output CONFIRMED lines for unchanged files.
  30.  
  31. -f X/--file=X
  32. Use X as the name of the MD5 file instead of the default of md5sum
  33. (edit source to change the default).
  34.  
  35. -h/--help
  36. Output this message then exit.
  37.  
  38. -l/--license
  39. Output the license terms for md5dir then exit.
  40.  
  41. -m/--master
  42. Enable master mode which creates a 'master' md5sum file for the
  43. entire hierarchy under each argument (instead of each subdir having
  44. its own md5sum). Note that per-directory md5sum files are removed
  45. in the process. The md5sum file is tagged so md5dir will in future
  46. always use master mode for the directory.
  47.  
  48. -n/--nocheck
  49. Only look for RENAMED/ADDED/DELETED files. Generally fast for
  50. subsequent runs since it does not check for changes to existing
  51. checksums (and ignores any --confirm/--update options).
  52.  
  53. -q/--quiet
  54. Do not produce any output (just update the md5sum files).
  55.  
  56. -r/--remove
  57. Ignore other options and remove any md5sum files found under the
  58. arguments (outputs REMOVING lines).
  59.  
  60. -u/--update
  61. Output UPDATED lines and update checksums for altered files (instead
  62. of outputting CHANGED lines). After updating, such files should be
  63. CONFIRMED on subsequent runs.
  64.  
  65. Copyright 2007 G raham P oulter
  66. """
  67.  
  68. __copyright__ = "2007 G raham P oulter"
  69. __author__ = "G raham P oulter"
  70. __license__ = """This program is free software: you can redistribute it and/or
  71. modify it under the terms of the GNU General Public License as published by the
  72. Free Software Foundation, either version 3 of the License, or (at your option)
  73. any later version.
  74.  
  75. This program is distributed in the hope that it will be useful, but WITHOUT ANY
  76. WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
  77. PARTICULAR PURPOSE. See the GNU General Public License for more details.
  78.  
  79. You should have received a copy of the GNU General Public License along with
  80. this program. If not, see <http://www.gnu.org/licenses/>."""
  81.  
  82. from getopt import getopt
  83. import md5
  84. import os
  85. import os.path as op
  86. import re
  87. import struct
  88. import sys
  89.  
  90. hashfile = "md5sum" # Default name for checksum file
  91. check = True # Whether to check for changes
  92. confirm = False # Whether to suppress CONFIRMED lines
  93. master = False # Whether to use master mode vs per-directory checksums
  94. quiet = False # Whether to suppress all output
  95. remove = False # Whether to work in 'REMOVING md5sum' mode
  96. update = False # Whether to update changed checksums
  97. mp3mode = False # Whether to use tag-skipping checksum for MP3s
  98. changelog = "md5changes.txt" # Names of changed files
  99.  
  100. # Regular expression for lines in GNU md5sum file
  101. md5line = re.compile(r"^([0-9a-f]{32}) [\ \*](.*)$")
  102.  
  103. ### WARNING: ORIGINAL FUNCTION IS IN MP3MD5.PY - MODIFY THERE
  104. def calculateUID(filepath):
  105. """Calculate MD5 for an MP3 excluding ID3v1 and ID3v2 tags if
  106. present. See www.id3.org for tag format specifications."""
  107. f = open(filepath, "rb")
  108. # Detect ID3v1 tag if present
  109. finish = os.stat(filepath).st_size;
  110. f.seek(-128, 2)
  111. if f.read(3) == "TAG":
  112. finish -= 128
  113. # ID3 at the start marks ID3v2 tag (0-2)
  114. f.seek(0)
  115. start = f.tell()
  116. if f.read(3) == "ID3":
  117. # Bytes w major/minor version (3-4)
  118. version = f.read(2)
  119. # Flags byte (5)
  120. flags = struct.unpack("B", f.read(1))[0]
  121. # Flat bit 4 means footer is present (10 bytes)
  122. footer = flags & (1<<4)
  123. # Size of tag body synchsafe integer (6-9)
  124. bs = struct.unpack("BBBB", f.read(4))
  125. bodysize = (bs[0]<<21) + (bs[1]<<14) + (bs[2]<<7) + bs[3]
  126. # Seek to end of ID3v2 tag
  127. f.seek(bodysize, 1)
  128. if footer:
  129. f.seek(10, 1)
  130. # Start of rest of the file
  131. start = f.tell()
  132. # Calculate MD5 using stuff between tags
  133. f.seek(start)
  134. h = md5.new()
  135. h.update(f.read(finish-start))
  136. f.close()
  137. return h.hexdigest()
  138.  
  139. def readsums(filepath):
  140. """Yield (md5, filename) pairs from a checksum file
  141.  
  142. @param filepath: Name of file containing checksums
  143. """
  144. if not op.isfile(hashfile):
  145. return
  146. for line in open(filepath, "r").readlines():
  147. match = md5line.match(line.rstrip("
  148. "))
  149. # Skip non-md5sum lines
  150. if not match:
  151. continue
  152. yield match.group(1), match.group(2)
  153.  
  154. def writesums(filepath, checksums, master, mp3mode):
  155. """Given a list of (filename,md5) in checksums, write them to
  156. filepath in md5sum format sorted by filename, with a #md5dir
  157. header"""
  158. f = open(filepath, "w")
  159. f.write("#md5dir")
  160. if master:
  161. f.write(" master")
  162. if mp3mode:
  163. f.write(" mp3mode")
  164. f.write("\n")
  165. for fname, md5 in sorted(checksums, key=lambda x:x[0]):
  166. f.write("%s %s\n" % (md5, fname))
  167. f.close()
  168.  
  169. def hashflags(dirpath):
  170. """If the directory holds a hashfile starting with #md5dir, return
  171. a list of the remaining words on that line (should be 'master' and
  172. 'mp3mode' for now)"""
  173. hpath = op.join(dirpath, hashfile)
  174. if not op.isfile(hpath):
  175. return []
  176. f = open(hpath, "r")
  177. s = f.readline().split()
  178. if s[0] != "#md5dir":
  179. return []
  180. else:
  181. return s[1:]
  182.  
  183. def calcsum(filepath, mp3mode):
  184. """Return md5 checksum for a file. Uses the tag-skipping algorithm
  185. for .mp3 files if in mp3mode."""
  186. if mp3mode and filepath.endswith(".mp3"):
  187. return calculateUID(filepath)
  188. h = md5.new()
  189. f = open(filepath, "rb")
  190. s = f.read(1048576)
  191. while s != "":
  192. h.update(s)
  193. s = f.read(1048576)
  194. f.close()
  195. # Output "." as a progress meter
  196. if not quiet:
  197. sys.stdout.write(".")
  198. sys.stdout.flush()
  199. return h.hexdigest()
  200.  
  201. def log(msg, filename):
  202. """Output a log message"""
  203. if not quiet:
  204. print "%-10s%s" % (msg, filename)
  205.  
  206. def md5dir(root, filenames, master):
  207. """Write an md5sum file in root for the list of filenames
  208. (specified relative to root).
  209. """
  210. # Decide whether to use mp3mode
  211. use_mp3mode = mp3mode
  212. if "mp3mode" in hashflags(root):
  213. use_mp3mode = True
  214.  
  215. # Change directory
  216. oldcwd = os.getcwd()
  217. os.chdir(root)
  218. filenames.sort()
  219.  
  220. # present is used to detect case changed files on Windows
  221. checksums = {} # Map fname->md5
  222. present = {} # Map md5->fname for present files
  223. deleted = {} # Map md5->fname for deleted files
  224.  
  225. changed = [] # Changed files
  226. added = [] # Added files
  227. confirmed = [] # Confirmed files
  228. renamed = [] # Renamed files as (old,new) pairs
  229.  
  230. # Read checksums from hashfile
  231. for md5, fname in readsums(hashfile):
  232. if op.isfile(fname):
  233. checksums[fname] = md5
  234. present[md5] = fname
  235. else:
  236. deleted[md5] = fname
  237.  
  238. # Read files from directory
  239. newhash = None
  240. for fname in filenames:
  241. if fname == hashfile:
  242. continue
  243. if fname not in checksums:
  244. newhash = calcsum(fname, use_mp3mode)
  245. checksums[fname] = newhash
  246. if newhash in deleted:
  247. renamed.append((deleted[newhash], fname))
  248. del deleted[newhash]
  249. elif newhash in present:
  250. # Identical files with case-differing names implies
  251. # a renaming on a case-insensitive filesystem
  252. oldname = present[newhash]
  253. if oldname.lower() == fname.lower():
  254. renamed.append((oldname, fname))
  255. del checksums[oldname]
  256. checksums[fname] = newhash
  257. else:
  258. added.append(fname)
  259. elif check:
  260. newhash = calcsum(fname, use_mp3mode)
  261. if checksums[fname] == newhash:
  262. confirmed.append(fname)
  263. else:
  264. changed.append(fname)
  265. if update:
  266. checksums[fname] = newhash
  267. # End the line of progress dots
  268. if newhash and not quiet:
  269. sys.stdout.write("\n")
  270.  
  271. # Log all changes
  272. if confirm:
  273. for fname in confirmed:
  274. log("CONFIRMED", op.join(root,fname))
  275. for old, new in renamed:
  276. log("RENAMED", "%s: %s --> %s" % (root,old,new))
  277. for fname in added:
  278. log("ADDED", op.join(root,fname))
  279. for fname in sorted(deleted.itervalues()):
  280. log("DELETED", op.join(root,fname))
  281. for fname in changed:
  282. if update:
  283. log("UPDATED", op.join(root,fname))
  284. else:
  285. log("CHANGED", op.join(root,fname))
  286. log("LOCATION", root)
  287. log("STATUS", "confirmed %d renamed %d added %d deleted %d changed %d" % (
  288. len(confirmed), len(renamed), len(added), len(deleted), len(changed)))
  289. if not quiet:
  290. sys.stdout.write("\n")
  291.  
  292. # Write list of changed files, removed on update
  293. if changed and not update:
  294. logfile = open(changelog, "a")
  295. try:
  296. for fname in changed:
  297. logfile.write(op.join(root, fname)+"\n")
  298. finally:
  299. logfile.close()
  300. if update and op.isfile(changelog):
  301. op.remove(changelog)
  302.  
  303. # Write hashfile if necessary
  304. if renamed or added or deleted or changed:
  305. if checksums:
  306. try:
  307. writesums(hashfile, checksums.iteritems(), master, use_mp3mode)
  308. except IOError, e:
  309. log("WARNING", "Error writing to %s" % op.join(root, hashfile))
  310. elif op.isfile(hashfile):
  311. os.remove(hashfile)
  312.  
  313. os.chdir(oldcwd)
  314.  
  315. def master_list(start):
  316. """Return a list of files relative to start directory, and remove
  317. all hashfiles except the one directly under start. """
  318. flist = []
  319. oldcwd = os.getcwd()
  320. os.chdir(start)
  321. # Collect all files under start
  322. for root, dirs, files in os.walk("."):
  323. for fname in files:
  324. # Only keep the topmost hash file
  325. if fname == hashfile and root != ".":
  326. log("REMOVING", op.join(root,fname))
  327. os.remove(op.join(root,fname))
  328. else:
  329. flist.append(op.join(root[2:],fname))
  330. os.chdir(oldcwd)
  331. return flist
  332.  
  333. if __name__ == "__main__":
  334. # Parse command-line options
  335. optlist, args = getopt(
  336. sys.argv[1:], "3cf:hlmnqru",
  337. ["mp3","confirm", "file=", "help", "license", "master", "nocheck", "quiet", "remove", "update"])
  338. for opt, value in optlist:
  339. if opt in ["-3", "--mp3"]:
  340. mp3mode = True
  341. elif opt in ["-c", "--confirm"]:
  342. confirm = True
  343. elif opt in ["-f", "--file"]:
  344. hashfile = value
  345. elif opt in ["-h","--help"]:
  346. print __doc__
  347. sys.exit(0)
  348. elif opt in ["-l", "--license"]:
  349. print license
  350. sys.exit(0)
  351. elif opt in ["-m", "--master"]:
  352. master = True
  353. elif opt in ["-n", "--nocheck"]:
  354. check = False
  355. elif opt in ["-q", "--quiet"]:
  356. quiet = True
  357. elif opt in ["-r", "--remove"]:
  358. remove = True
  359. elif opt in ["-u", "--update"]:
  360. update = True
  361. if len(args) == 0:
  362. log("WARNING", "Exiting because no directories given (use -h for help)")
  363. sys.exit(0)
  364.  
  365. # Remove old changelog
  366. if op.exists(changelog):
  367. os.remove(changelog)
  368.  
  369. # Treat each argument separately
  370. for start in args:
  371. if not op.isdir(start):
  372. log("WARNING", "Argument %s is not a directory" % start)
  373. continue
  374.  
  375. # Remove checksum files
  376. if remove:
  377. for root, dirs, files in os.walk(start):
  378. dirs.sort()
  379. files.sort()
  380. for fname in files:
  381. if fname == hashfile:
  382. log("REMOVING", op.join(root,fname))
  383. os.remove(op.join(root,fname))
  384.  
  385. # Master checksum
  386. elif master:
  387. md5dir(start, master_list(start), master=True)
  388.  
  389. # Per-directory checksum
  390. else:
  391. for root, dirs, files in os.walk(start):
  392. dirs.sort()
  393. files.sort()
  394. if "master" in hashflags(root):
  395. del dirs[:]
  396. md5dir(root, master_list(root), master=True)
  397. else:
  398. md5dir(root, files, master=False)

Report this snippet  

You need to login to post a comment.