1 // Copyright 2012 The Chromium Authors 2 // Use of this source code is governed by a BSD-style license that can be 3 // found in the LICENSE file. 4 5 // NB: Modelled after Mozilla's code (originally written by Pamela Greene, 6 // later modified by others), but almost entirely rewritten for Chrome. 7 // (netwerk/dns/src/nsEffectiveTLDService.h) 8 /* ***** BEGIN LICENSE BLOCK ***** 9 * Version: MPL 1.1/GPL 2.0/LGPL 2.1 10 * 11 * The contents of this file are subject to the Mozilla Public License Version 12 * 1.1 (the "License"); you may not use this file except in compliance with 13 * the License. You may obtain a copy of the License at 14 * http://www.mozilla.org/MPL/ 15 * 16 * Software distributed under the License is distributed on an "AS IS" basis, 17 * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License 18 * for the specific language governing rights and limitations under the 19 * License. 20 * 21 * The Original Code is Mozilla TLD Service 22 * 23 * The Initial Developer of the Original Code is 24 * Google Inc. 25 * Portions created by the Initial Developer are Copyright (C) 2006 26 * the Initial Developer. All Rights Reserved. 27 * 28 * Contributor(s): 29 * Pamela Greene <[email protected]> (original author) 30 * 31 * Alternatively, the contents of this file may be used under the terms of 32 * either the GNU General Public License Version 2 or later (the "GPL"), or 33 * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), 34 * in which case the provisions of the GPL or the LGPL are applicable instead 35 * of those above. If you wish to allow use of your version of this file only 36 * under the terms of either the GPL or the LGPL, and not to allow others to 37 * use your version of this file under the terms of the MPL, indicate your 38 * decision by deleting the provisions above and replace them with the notice 39 * and other provisions required by the GPL or the LGPL. If you do not delete 40 * the provisions above, a recipient may use your version of this file under 41 * the terms of any one of the MPL, the GPL or the LGPL. 42 * 43 * ***** END LICENSE BLOCK ***** */ 44 45 /* 46 (Documentation based on the Mozilla documentation currently at 47 http://wiki.mozilla.org/Gecko:Effective_TLD_Service, written by the same 48 author.) 49 50 The RegistryControlledDomainService examines the hostname of a GURL passed to 51 it and determines the longest portion that is controlled by a registrar. 52 Although technically the top-level domain (TLD) for a hostname is the last 53 dot-portion of the name (such as .com or .org), many domains (such as co.uk) 54 function as though they were TLDs, allocating any number of more specific, 55 essentially unrelated names beneath them. For example, .uk is a TLD, but 56 nobody is allowed to register a domain directly under .uk; the "effective" 57 TLDs are ac.uk, co.uk, and so on. We wouldn't want to allow any site in 58 *.co.uk to set a cookie for the entire co.uk domain, so it's important to be 59 able to identify which higher-level domains function as effective TLDs and 60 which can be registered. 61 62 The service obtains its information about effective TLDs from a text resource 63 that must be in the following format: 64 65 * It should use plain ASCII. 66 * It should contain one domain rule per line, terminated with \n, with nothing 67 else on the line. (The last rule in the file may omit the ending \n.) 68 * Rules should have been normalized using the same canonicalization that GURL 69 applies. For ASCII, that means they're not case-sensitive, among other 70 things; other normalizations are applied for other characters. 71 * Each rule should list the entire TLD-like domain name, with any subdomain 72 portions separated by dots (.) as usual. 73 * Rules should neither begin nor end with a dot. 74 * If a hostname matches more than one rule, the most specific rule (that is, 75 the one with more dot-levels) will be used. 76 * Other than in the case of wildcards (see below), rules do not implicitly 77 include their subcomponents. For example, "bar.baz.uk" does not imply 78 "baz.uk", and if "bar.baz.uk" is the only rule in the list, "foo.bar.baz.uk" 79 will match, but "baz.uk" and "qux.baz.uk" won't. 80 * The wildcard character '*' will match any valid sequence of characters. 81 * Wildcards may only appear as the entire most specific level of a rule. That 82 is, a wildcard must come at the beginning of a line and must be followed by 83 a dot. (You may not use a wildcard as the entire rule.) 84 * A wildcard rule implies a rule for the entire non-wildcard portion. For 85 example, the rule "*.foo.bar" implies the rule "foo.bar" (but not the rule 86 "bar"). This is typically important in the case of exceptions (see below). 87 * The exception character '!' before a rule marks an exception to a wildcard 88 rule. If your rules are "*.tokyo.jp" and "!pref.tokyo.jp", then 89 "a.b.tokyo.jp" has an effective TLD of "b.tokyo.jp", but "a.pref.tokyo.jp" 90 has an effective TLD of "tokyo.jp" (the exception prevents the wildcard 91 match, and we thus fall through to matching on the implied "tokyo.jp" rule 92 from the wildcard). 93 * If you use an exception rule without a corresponding wildcard rule, the 94 behavior is undefined. 95 96 Firefox has a very similar service, and it's their data file we use to 97 construct our resource. However, the data expected by this implementation 98 differs from the Mozilla file in several important ways: 99 (1) We require that all single-level TLDs (com, edu, etc.) be explicitly 100 listed. As of this writing, Mozilla's file includes the single-level 101 TLDs too, but that might change. 102 (2) Our data is expected be in pure ASCII: all UTF-8 or otherwise encoded 103 items must already have been normalized. 104 (3) We do not allow comments, rule notes, blank lines, or line endings other 105 than LF. 106 Rules are also expected to be syntactically valid. 107 108 The utility application tld_cleanup.exe converts a Mozilla-style file into a 109 Chrome one, making sure that single-level TLDs are explicitly listed, using 110 GURL to normalize rules, and validating the rules. 111 */ 112 113 #ifndef NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_ 114 #define NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_ 115 116 #include <stddef.h> 117 118 #include <optional> 119 #include <string> 120 #include <string_view> 121 122 #include "net/base/net_export.h" 123 124 class GURL; 125 126 namespace url { 127 class Origin; 128 } 129 130 struct DomainRule; 131 132 namespace net::registry_controlled_domains { 133 134 // This enum is a required parameter to all public methods declared for this 135 // service. The Public Suffix List (http://publicsuffix.org/) this service 136 // uses as a data source splits all effective-TLDs into two groups. The main 137 // group describes registries that are acknowledged by ICANN. The second group 138 // contains a list of private additions for domains that enable external users 139 // to create subdomains, such as appspot.com. 140 // The RegistryFilter enum lets you choose whether you want to include the 141 // private additions in your lookup. 142 // See this for example use cases: 143 // https://wiki.mozilla.org/Public_Suffix_List/Use_Cases 144 enum PrivateRegistryFilter { 145 EXCLUDE_PRIVATE_REGISTRIES = 0, 146 INCLUDE_PRIVATE_REGISTRIES 147 }; 148 149 // This enum is a required parameter to the GetRegistryLength functions 150 // declared for this service. Whenever there is no matching rule in the 151 // effective-TLD data (or in the default data, if the resource failed to 152 // load), the result will be dependent on which enum value was passed in. 153 // If EXCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting registry length 154 // will be 0. If INCLUDE_UNKNOWN_REGISTRIES was passed in, the resulting 155 // registry length will be the length of the last subcomponent (eg. 3 for 156 // foobar.baz). 157 enum UnknownRegistryFilter { 158 EXCLUDE_UNKNOWN_REGISTRIES = 0, 159 INCLUDE_UNKNOWN_REGISTRIES 160 }; 161 162 // Returns the registered, organization-identifying host and all its registry 163 // information, but no subdomains, from the given GURL. Returns an empty 164 // string if the GURL is invalid, has no host (e.g. a file: URL), has multiple 165 // trailing dots, is an IP address, has only one subcomponent (i.e. no dots 166 // other than leading/trailing ones), or is itself a recognized registry 167 // identifier. If no matching rule is found in the effective-TLD data (or in 168 // the default data, if the resource failed to load), the last subcomponent of 169 // the host is assumed to be the registry. 170 // 171 // Examples: 172 // http://www.google.com/file.html -> "google.com" (com) 173 // http://..google.com/file.html -> "google.com" (com) 174 // http://google.com./file.html -> "google.com." (com) 175 // http://a.b.co.uk/file.html -> "b.co.uk" (co.uk) 176 // file:///C:/bar.html -> "" (no host) 177 // http://foo.com../file.html -> "" (multiple trailing dots) 178 // http://192.168.0.1/file.html -> "" (IP address) 179 // http://bar/file.html -> "" (no subcomponents) 180 // http://co.uk/file.html -> "" (host is a registry) 181 // http://foo.bar/file.html -> "foo.bar" (no rule; assume bar) 182 NET_EXPORT std::string GetDomainAndRegistry(const GURL& gurl, 183 PrivateRegistryFilter filter); 184 185 // Like the GURL version, but takes an Origin. Returns an empty string if the 186 // Origin is opaque. 187 NET_EXPORT std::string GetDomainAndRegistry(const url::Origin& origin, 188 PrivateRegistryFilter filter); 189 190 // Like the GURL / Origin versions, but takes a host (which is canonicalized 191 // internally). Prefer either the GURL or Origin variants instead of this one 192 // to avoid needing to re-canonicalize the host. 193 NET_EXPORT std::string GetDomainAndRegistry(std::string_view host, 194 PrivateRegistryFilter filter); 195 196 // These convenience functions return true if the two GURLs or Origins both have 197 // hosts and one of the following is true: 198 // * The hosts are identical. 199 // * They each have a known domain and registry, and it is the same for both 200 // URLs. Note that this means the trailing dot, if any, must match too. 201 // Effectively, callers can use this function to check whether the input URLs 202 // represent hosts "on the same site". 203 NET_EXPORT bool SameDomainOrHost(const GURL& gurl1, const GURL& gurl2, 204 PrivateRegistryFilter filter); 205 NET_EXPORT bool SameDomainOrHost(const url::Origin& origin1, 206 const url::Origin& origin2, 207 PrivateRegistryFilter filter); 208 // Note: this returns false if |origin2| is not set. 209 NET_EXPORT bool SameDomainOrHost(const url::Origin& origin1, 210 const std::optional<url::Origin>& origin2, 211 PrivateRegistryFilter filter); 212 NET_EXPORT bool SameDomainOrHost(const GURL& gurl, 213 const url::Origin& origin, 214 PrivateRegistryFilter filter); 215 216 // Finds the length in bytes of the registrar portion of the host in the 217 // given GURL. Returns std::string::npos if the GURL is invalid or has no 218 // host (e.g. a file: URL). Returns 0 if the GURL has multiple trailing dots, 219 // is an IP address, has no subcomponents, or is itself a recognized registry 220 // identifier. The result is also dependent on the UnknownRegistryFilter. 221 // If no matching rule is found in the effective-TLD data (or in 222 // the default data, if the resource failed to load), returns 0 if 223 // |unknown_filter| is EXCLUDE_UNKNOWN_REGISTRIES, or the length of the last 224 // subcomponent if |unknown_filter| is INCLUDE_UNKNOWN_REGISTRIES. 225 // 226 // Examples: 227 // http://www.google.com/file.html -> 3 (com) 228 // http://..google.com/file.html -> 3 (com) 229 // http://google.com./file.html -> 4 (com) 230 // http://a.b.co.uk/file.html -> 5 (co.uk) 231 // file:///C:/bar.html -> std::string::npos (no host) 232 // http://foo.com../file.html -> 0 (multiple trailing 233 // dots) 234 // http://192.168.0.1/file.html -> 0 (IP address) 235 // http://bar/file.html -> 0 (no subcomponents) 236 // http://co.uk/file.html -> 0 (host is a registry) 237 // http://foo.bar/file.html -> 0 or 3, depending (no rule; assume 238 // bar) 239 NET_EXPORT size_t GetRegistryLength(const GURL& gurl, 240 UnknownRegistryFilter unknown_filter, 241 PrivateRegistryFilter private_filter); 242 243 // Returns true if the given host name has a registry-controlled domain. The 244 // host name will be internally canonicalized. Also returns true for invalid 245 // host names like "*.google.com" as long as it has a valid registry-controlled 246 // portion (see PermissiveGetHostRegistryLength for particulars). 247 NET_EXPORT bool HostHasRegistryControlledDomain( 248 std::string_view host, 249 UnknownRegistryFilter unknown_filter, 250 PrivateRegistryFilter private_filter); 251 252 // Like GetRegistryLength, but takes a previously-canonicalized host instead of 253 // a GURL. Prefer the GURL version or HasRegistryControlledDomain to eliminate 254 // the possibility of bugs with non-canonical hosts. 255 // 256 // If you have a non-canonical host name, use the "Permissive" version instead. 257 NET_EXPORT size_t 258 GetCanonicalHostRegistryLength(std::string_view canon_host, 259 UnknownRegistryFilter unknown_filter, 260 PrivateRegistryFilter private_filter); 261 262 // Like GetRegistryLength for a potentially non-canonicalized hostname. This 263 // splits the input into substrings at '.' characters, then attempts to 264 // piecewise-canonicalize the substrings. After finding the registry length of 265 // the concatenated piecewise string, it then maps back to the corresponding 266 // length in the original input string. 267 // 268 // It will also handle hostnames that are otherwise invalid as long as they 269 // contain a valid registry controlled domain at the end. Invalid dot-separated 270 // portions of the domain will be left as-is when the string is looked up in 271 // the registry database (which will result in no match). 272 // 273 // This will handle all cases except for the pattern: 274 // <invalid-host-chars> <non-literal-dot> <valid-registry-controlled-domain> 275 // For example: 276 // "%00foo%2Ecom" (would canonicalize to "foo.com" if the "%00" was removed) 277 // A non-literal dot (like "%2E" or a fullwidth period) will normally get 278 // canonicalized to a dot if the host chars were valid. But since the %2E will 279 // be in the same substring as the %00, the substring will fail to 280 // canonicalize, the %2E will be left escaped, and the valid registry 281 // controlled domain at the end won't match. 282 // 283 // The string won't be trimmed, so things like trailing spaces will be 284 // considered part of the host and therefore won't match any TLD. It will 285 // return std::string::npos like GetRegistryLength() for empty input, but 286 // because invalid portions are skipped, it won't return npos in any other case. 287 NET_EXPORT size_t 288 PermissiveGetHostRegistryLength(std::string_view host, 289 UnknownRegistryFilter unknown_filter, 290 PrivateRegistryFilter private_filter); 291 NET_EXPORT size_t 292 PermissiveGetHostRegistryLength(std::u16string_view host, 293 UnknownRegistryFilter unknown_filter, 294 PrivateRegistryFilter private_filter); 295 296 typedef const struct DomainRule* (*FindDomainPtr)(const char *, unsigned int); 297 298 // Used for unit tests. Uses default domains. 299 NET_EXPORT_PRIVATE void ResetFindDomainGraphForTesting(); 300 301 // Used for unit tests, so that a frozen list of domains is used. 302 NET_EXPORT_PRIVATE void SetFindDomainGraphForTesting( 303 const unsigned char* domains, 304 size_t length); 305 306 } // namespace net::registry_controlled_domains 307 308 #endif // NET_BASE_REGISTRY_CONTROLLED_DOMAINS_REGISTRY_CONTROLLED_DOMAIN_H_ 309