Posted By

dominicsayers on 01/28/09


Tagged

rfc


Versions (?)

Who likes this?

10 people have marked this snippet as a favorite

dominicsayers
Scooter
jeremyhcobb
koncept
umang_nine
sidisinsane
cjwilburn
nb109
ewanmacleod
hiramvb


RFC-compliant email address validator


Published in: PHP 


URL: http://www.dominicsayers.com/isemail/

A PHP function that correctly validates all parts of a given email address, according to RFCs 5322, 5321, 1123, 2396, 3696, 4291, 4343, 2821 & 2822. I’ve released it under a license that allows you to use it royalty-free in commercial or non-commercial work.

The test cases and the latest version of the code will always be here: http://code.google.com/p/isemail/source/browse/#svn/trunk

  1. <?php
  2. /*
  3. Copyright 2009 Dominic Sayers
  4. dominic_sayers@hotmail.com
  5. http://www.dominicsayers.com
  6.  
  7. Version 1.7
  8.  
  9. This source file is subject to the Common Public Attribution License Version 1.0 (CPAL) license.
  10. The license terms are available through the world-wide-web at http://www.opensource.org/licenses/cpal_1.0
  11. */
  12.  
  13. // PHPLint modules
  14. /*.
  15. require_module 'standard';
  16. require_module 'pcre';
  17. .*/
  18. /*.boolean.*/ function is_email (/*.string.*/ $email, $checkDNS = false) {
  19. // Check that $email is a valid address. Read the following RFCs to understand the constraints:
  20. // (http://tools.ietf.org/html/rfc5322)
  21. // (http://tools.ietf.org/html/rfc3696)
  22. // (http://tools.ietf.org/html/rfc5321)
  23. // (http://tools.ietf.org/html/rfc4291#section-2.2)
  24. // (http://tools.ietf.org/html/rfc1123#section-2.1)
  25.  
  26. // the upper limit on address lengths should normally be considered to be 256
  27. // (http://www.rfc-editor.org/errata_search.php?rfc=3696)
  28. // NB I think John Klensin is misreading RFC 5321 and the the limit should actually be 254
  29. // However, I will stick to the published number until it is changed.
  30. //
  31. // The maximum total length of a reverse-path or forward-path is 256
  32. // characters (including the punctuation and element separators)
  33. // (http://tools.ietf.org/html/rfc5321#section-4.5.3.1.3)
  34. $emailLength = strlen($email);
  35. if ($emailLength > 256) return false; // Too long
  36.  
  37. // Contemporary email addresses consist of a "local part" separated from
  38. // a "domain part" (a fully-qualified domain name) by an at-sign ("@").
  39. // (http://tools.ietf.org/html/rfc3696#section-3)
  40. $atIndex = strrpos($email,'@');
  41.  
  42. if ($atIndex === false) return false; // No at-sign
  43. if ($atIndex === 0) return false; // No local part
  44. if ($atIndex === $emailLength) return false; // No domain part
  45.  
  46. // Sanitize comments
  47. // - remove nested comments, quotes and dots in comments
  48. // - remove parentheses and dots from quoted strings
  49. $braceDepth = 0;
  50. $inQuote = false;
  51. $escapeThisChar = false;
  52.  
  53. for ($i = 0; $i < $emailLength; ++$i) {
  54. $char = $email[$i];
  55. $replaceChar = false;
  56.  
  57. if ($char === '\\') {
  58. $escapeThisChar = !$escapeThisChar; // Escape the next character?
  59. } else {
  60. switch ($char) {
  61. case '(':
  62. if ($escapeThisChar) {
  63. $replaceChar = true;
  64. } else {
  65. if ($inQuote) {
  66. $replaceChar = true;
  67. } else {
  68. if ($braceDepth++ > 0) $replaceChar = true; // Increment brace depth
  69. }
  70. }
  71.  
  72. break;
  73. case ')':
  74. if ($escapeThisChar) {
  75. $replaceChar = true;
  76. } else {
  77. if ($inQuote) {
  78. $replaceChar = true;
  79. } else {
  80. if (--$braceDepth > 0) $replaceChar = true; // Decrement brace depth
  81. if ($braceDepth < 0) $braceDepth = 0;
  82. }
  83. }
  84.  
  85. break;
  86. case '"':
  87. if ($escapeThisChar) {
  88. $replaceChar = true;
  89. } else {
  90. if ($braceDepth === 0) {
  91. $inQuote = !$inQuote; // Are we inside a quoted string?
  92. } else {
  93. $replaceChar = true;
  94. }
  95. }
  96.  
  97. break;
  98. case '.': // Dots don't help us either
  99. if ($escapeThisChar) {
  100. $replaceChar = true;
  101. } else {
  102. if ($braceDepth > 0) $replaceChar = true;
  103. }
  104.  
  105. break;
  106. }
  107.  
  108. $escapeThisChar = false;
  109. if ($replaceChar) $email[$i] = 'x'; // Replace the offending character with something harmless
  110. }
  111. }
  112.  
  113. $localPart = substr($email, 0, $atIndex);
  114. $domain = substr($email, $atIndex + 1);
  115. $FWS = "(?:(?:(?:[ \\t]*(?:\\r\\n))?[ \\t]+)|(?:[ \\t]+(?:(?:\\r\\n)[ \\t]+)*))"; // Folding white space
  116. // Let's check the local part for RFC compliance...
  117. //
  118. // local-part = dot-atom / quoted-string / obs-local-part
  119. // obs-local-part = word *("." word)
  120. // (http://tools.ietf.org/html/rfc5322#section-3.4.1)
  121. //
  122. // Problem: need to distinguish between "first.last" and "first"."last"
  123. // (i.e. one element or two). And I suck at regexes.
  124. $dotArray = /*. (array[int]string) .*/ preg_split('/\\.(?=(?:[^\\"]*\\"[^\\"]*\\")*(?![^\\"]*\\"))/m', $localPart);
  125. $partLength = 0;
  126.  
  127. foreach ($dotArray as $element) {
  128. // Remove any leading or trailing FWS
  129. $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
  130.  
  131. // Then we need to remove all valid comments (i.e. those at the start or end of the element
  132. $elementLength = strlen($element);
  133.  
  134. if ($element[0] === '(') {
  135. $indexBrace = strpos($element, ')');
  136. if ($indexBrace !== false) {
  137. if (preg_match('/(?<!\\\\)[\\(\\)]/', substr($element, 1, $indexBrace - 1)) > 0) {
  138. return false; // Illegal characters in comment
  139. }
  140. $element = substr($element, $indexBrace + 1, $elementLength - $indexBrace - 1);
  141. $elementLength = strlen($element);
  142. }
  143. }
  144.  
  145. if ($element[$elementLength - 1] === ')') {
  146. $indexBrace = strrpos($element, '(');
  147. if ($indexBrace !== false) {
  148. if (preg_match('/(?<!\\\\)(?:[\\(\\)])/', substr($element, $indexBrace + 1, $elementLength - $indexBrace - 2)) > 0) {
  149. return false; // Illegal characters in comment
  150. }
  151. $element = substr($element, 0, $indexBrace);
  152. $elementLength = strlen($element);
  153. }
  154. }
  155.  
  156. // Remove any leading or trailing FWS around the element (inside any comments)
  157. $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
  158.  
  159. // What's left counts towards the maximum length for this part
  160. if ($partLength > 0) $partLength++; // for the dot
  161. $partLength += strlen($element);
  162.  
  163. // Each dot-delimited component can be an atom or a quoted string
  164. // (because of the obs-local-part provision)
  165. if (preg_match('/^"(?:.)*"$/s', $element) > 0) {
  166. // Quoted-string tests:
  167. //
  168. // Remove any FWS
  169. $element = preg_replace("/(?<!\\\\)$FWS/", '', $element);
  170. // My regex skillz aren't up to distinguishing between \" \\" \\\" \\\\" etc.
  171. // So remove all \\ from the string first...
  172. $element = preg_replace('/\\\\\\\\/', ' ', $element);
  173. if (preg_match('/(?<!\\\\|^)["\\r\\n\\x00](?!$)|\\\\"$|""/', $element) > 0) return false; // ", CR, LF and NUL must be escaped, "" is too short
  174. } else {
  175. // Unquoted string tests:
  176. //
  177. // Period (".") may...appear, but may not be used to start or end the
  178. // local part, nor may two or more consecutive periods appear.
  179. // (http://tools.ietf.org/html/rfc3696#section-3)
  180. //
  181. // A zero-length element implies a period at the beginning or end of the
  182. // local part, or two periods together. Either way it's not allowed.
  183. if ($element === '') return false; // Dots in wrong place
  184.  
  185. // Any ASCII graphic (printing) character other than the
  186. // at-sign ("@"), backslash, double quote, comma, or square brackets may
  187. // appear without quoting. If any of that list of excluded characters
  188. // are to appear, they must be quoted
  189. // (http://tools.ietf.org/html/rfc3696#section-3)
  190. //
  191. // Any excluded characters? i.e. 0x00-0x20, (, ), <, >, [, ], :, ;, @, \, comma, period, "
  192. if (preg_match('/[\\x00-\\x20\\(\\)<>\\[\\]:;@\\\\,\\."]/', $element) > 0) return false; // These characters must be in a quoted string
  193. }
  194. }
  195.  
  196. if ($partLength > 64) return false; // Local part must be 64 characters or less
  197.  
  198. // Now let's check the domain part...
  199.  
  200. // The domain name can also be replaced by an IP address in square brackets
  201. // (http://tools.ietf.org/html/rfc3696#section-3)
  202. // (http://tools.ietf.org/html/rfc5321#section-4.1.3)
  203. // (http://tools.ietf.org/html/rfc4291#section-2.2)
  204. if (preg_match('/^\\[(.)+]$/', $domain) === 1) {
  205. // It's an address-literal
  206. $addressLiteral = substr($domain, 1, strlen($domain) - 2);
  207. $matchesIP = array();
  208.  
  209. // Extract IPv4 part from the end of the address-literal (if there is one)
  210. if (preg_match('/\\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/', $addressLiteral, $matchesIP) > 0) {
  211. $index = strrpos($addressLiteral, $matchesIP[0]);
  212.  
  213. if ($index === 0) {
  214. // Nothing there except a valid IPv4 address, so...
  215. return true;
  216. } else {
  217. // Assume it's an attempt at a mixed address (IPv6 + IPv4)
  218. if ($addressLiteral[$index - 1] !== ':') return false; // Character preceding IPv4 address must be ':'
  219. if (substr($addressLiteral, 0, 5) !== 'IPv6:') return false; // RFC5321 section 4.1.3
  220.  
  221. $IPv6 = substr($addressLiteral, 5, ($index ===7) ? 2 : $index - 6);
  222. $groupMax = 6;
  223. }
  224. } else {
  225. // It must be an attempt at pure IPv6
  226. if (substr($addressLiteral, 0, 5) !== 'IPv6:') return false; // RFC5321 section 4.1.3
  227. $IPv6 = substr($addressLiteral, 5);
  228. $groupMax = 8;
  229. }
  230.  
  231. $groupCount = preg_match_all('/^[0-9a-fA-F]{0,4}|\\:[0-9a-fA-F]{0,4}|(.)/', $IPv6, $matchesIP);
  232. $index = strpos($IPv6,'::');
  233.  
  234. if ($index === false) {
  235. // We need exactly the right number of groups
  236. if ($groupCount !== $groupMax) return false; // RFC5321 section 4.1.3
  237. } else {
  238. if ($index !== strrpos($IPv6,'::')) return false; // More than one '::'
  239. $groupMax = ($index === 0 || $index === (strlen($IPv6) - 2)) ? $groupMax : $groupMax - 1;
  240. if ($groupCount > $groupMax) return false; // Too many IPv6 groups in address
  241. }
  242.  
  243. // Check for unmatched characters
  244. array_multisort($matchesIP[1], SORT_DESC);
  245. if ($matchesIP[1][0] !== '') return false; // Illegal characters in address
  246.  
  247. // It's a valid IPv6 address, so...
  248. return true;
  249. } else {
  250. // It's a domain name...
  251.  
  252. // The syntax of a legal Internet host name was specified in RFC-952
  253. // One aspect of host name syntax is hereby changed: the
  254. // restriction on the first character is relaxed to allow either a
  255. // letter or a digit.
  256. // (http://tools.ietf.org/html/rfc1123#section-2.1)
  257. //
  258. // NB RFC 1123 updates RFC 1035, but this is not currently apparent from reading RFC 1035.
  259. //
  260. // Most common applications, including email and the Web, will generally not
  261. // permit...escaped strings
  262. // (http://tools.ietf.org/html/rfc3696#section-2)
  263. //
  264. // the better strategy has now become to make the "at least one period" test,
  265. // to verify LDH conformance (including verification that the apparent TLD name
  266. // is not all-numeric)
  267. // (http://tools.ietf.org/html/rfc3696#section-2)
  268. //
  269. // Characters outside the set of alphabetic characters, digits, and hyphen MUST NOT appear in domain name
  270. // labels for SMTP clients or servers
  271. // (http://tools.ietf.org/html/rfc5321#section-4.1.2)
  272. //
  273. // RFC5321 precludes the use of a trailing dot in a domain name for SMTP purposes
  274. // (http://tools.ietf.org/html/rfc5321#section-4.1.2)
  275. $dotArray = /*. (array[int]string) .*/ preg_split('/\\.(?=(?:[^\\"]*\\"[^\\"]*\\")*(?![^\\"]*\\"))/m', $domain);
  276. $partLength = 0;
  277.  
  278. if (count($dotArray) === 1) return false; // Mail host can't be a TLD
  279.  
  280. foreach ($dotArray as $element) {
  281. // Remove any leading or trailing FWS
  282. $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
  283.  
  284. // Then we need to remove all valid comments (i.e. those at the start or end of the element
  285. $elementLength = strlen($element);
  286.  
  287. if ($element[0] === '(') {
  288. $indexBrace = strpos($element, ')');
  289. if ($indexBrace !== false) {
  290. if (preg_match('/(?<!\\\\)[\\(\\)]/', substr($element, 1, $indexBrace - 1)) > 0) {
  291. return false; // Illegal characters in comment
  292. }
  293. $element = substr($element, $indexBrace + 1, $elementLength - $indexBrace - 1);
  294. $elementLength = strlen($element);
  295. }
  296. }
  297.  
  298. if ($element[$elementLength - 1] === ')') {
  299. $indexBrace = strrpos($element, '(');
  300. if ($indexBrace !== false) {
  301. if (preg_match('/(?<!\\\\)(?:[\\(\\)])/', substr($element, $indexBrace + 1, $elementLength - $indexBrace - 2)) > 0) {
  302. return false; // Illegal characters in comment
  303. }
  304. $element = substr($element, 0, $indexBrace);
  305. $elementLength = strlen($element);
  306. }
  307. }
  308.  
  309. // Remove any leading or trailing FWS around the element (inside any comments)
  310. $element = preg_replace("/^$FWS|$FWS\$/", '', $element);
  311.  
  312. // What's left counts towards the maximum length for this part
  313. if ($partLength > 0) $partLength++; // for the dot
  314. $partLength += strlen($element);
  315.  
  316. // The DNS defines domain name syntax very generally -- a
  317. // string of labels each containing up to 63 8-bit octets,
  318. // separated by dots, and with a maximum total of 255
  319. // octets.
  320. // (http://tools.ietf.org/html/rfc1123#section-6.1.3.5)
  321. if ($elementLength > 63) return false; // Label must be 63 characters or less
  322.  
  323. // Each dot-delimited component must be atext
  324. // A zero-length element implies a period at the beginning or end of the
  325. // local part, or two periods together. Either way it's not allowed.
  326. if ($elementLength === 0) return false; // Dots in wrong place
  327.  
  328. // Any ASCII graphic (printing) character other than the
  329. // at-sign ("@"), backslash, double quote, comma, or square brackets may
  330. // appear without quoting. If any of that list of excluded characters
  331. // are to appear, they must be quoted
  332. // (http://tools.ietf.org/html/rfc3696#section-3)
  333. //
  334. // If the hyphen is used, it is not permitted to appear at
  335. // either the beginning or end of a label.
  336. // (http://tools.ietf.org/html/rfc3696#section-2)
  337. //
  338. // Any excluded characters? i.e. 0x00-0x20, (, ), <, >, [, ], :, ;, @, \, comma, period, "
  339. if (preg_match('/[\\x00-\\x20\\(\\)<>\\[\\]:;@\\\\,\\."]|^-|-$/', $element) > 0) {
  340. return false;
  341. }
  342. }
  343.  
  344. if ($partLength > 255) return false; // Local part must be 64 characters or less
  345.  
  346. if (preg_match('/^[0-9]+$/', $element) > 0) return false; // TLD can't be all-numeric
  347.  
  348. // Check DNS?
  349. if ($checkDNS && function_exists('checkdnsrr')) {
  350. if (!(checkdnsrr($domain, 'A') || checkdnsrr($domain, 'MX'))) {
  351. return false; // Domain doesn't actually exist
  352. }
  353. }
  354. }
  355.  
  356. // Eliminate all other factors, and the one which remains must be the truth.
  357. // (Sherlock Holmes, The Sign of Four)
  358. return true;
  359. }
  360. ?>

Report this snippet 

You need to login to post a comment.