Early generation filtering technology covered both bases with varying degrees of adequacy. Vendors chose Web crawlers and site mining as the fastest and easiest means of developing large URL databases, but a high percentage of the sites chosen for rating were so obscure as to be irrelevant to users. This made direct comparisons of “number of URLs rated” difficult.
New content threats provide new opportunities and new challenges for Web filtering. As firewalls and desktop antivirus became ubiquitous, hackers and unethical entrepreneurs found the only remaining open door to be the Web browser. Web content threats are the fastest growing computer danger because most organisations leave ports 80 and 443 open through their firewalls. The browser has become the soft underbelly of network security.
Database “coverage” and classification “accuracy” are the most important factors to effectively enforce appropriate use policy and secure Web content. Without either, policies simply won’t work and users will be vulnerable. To be viable, a Web filtering architecture must be as “future proof” as possible to ensure coverage is optimised for both known and new Web pages. The architecture must also provide a means of accurately reflecting the complexity inherent in contemporary Web pages. Overly simplified rating structures are quickly overwhelmed by millions of unique, and often multidisciplinary, Web sites. These are just a few of the many elements that must be addressed to deliver a high database coverage rate with highly accurate classification.
Coverage is the ability of a filtering product to identify all websites which should be placed in a given category. Coverage answers the question, “Of 100 websites that were actually category ‘X’ (Pornography, Spyware, Gambling, etc.), how many did the filter actually categorise as ‘X’?” The higher the percentage is, then the greater the filter’s coverage.
To have the best coverage, a web filtering product must be able to:
- Rate domains (rather than URL or IP address) where appropriate
An individual domain may have thousands of unique URLs underneath it. New URLs may be added under these domains daily, or in some cases, by the minute. For homogenous domains, there are coverage and performance advantages to rating the domain instead of the URL or IP. By rating the domain, all new URLs added under that domain are instantly covered. This also requires less space in the database, which improves overall performance.
- Categorise websites by IP address, as well as by URL as appropriate
Websites are accessed not only via URL, but also via IP address. Although this sounds simplistic, not all filtering products are able to categorise both. Some early generation filtering products attempt to infer ratings for requested IP addresses from known URLs by using reverse-DNS lookups, but this is slow and unreliable.
- Rate sites harvested primarily from user requests
Another measure of coverage quality is the relevance of a filtering database. No vendor can rate all 16 billion+ web pages on the Web, and it’s not necessary to do so. A large percentage of those pages are defunct or so obscure that including a rating adds no value. They are not relevant for policy enforcement, yet do add a performance cost and hence should be avoided.
- Transparently pull updates on demand
Being able to pull new ratings on demand as needed provides better real-time coverage than frequently pushing batches of recent URL ratings to the local copy of the filtering data base. Automated pulling checks for up-to-the-second ratings of the specific Web page being accessed. In contrast, pushing updates at intervals is more likely to result in missing a relevant web site. Frequent pushes use more bandwidth for thousands of sub-optimal refreshes per month, most of which are pages users in your organisation will not see on a given day.
- Categorise new or unrated Web sites on the fly
Tens of millions of new pages are created each month, and approximately 30,000 new pornographic pages a day. Web crawlers and data mining are prone to finding irrelevant pages, and such a “boil the ocean” approach finds new pages too slowly. High coverage requires the ability to rate new pages in real time, at the moment a user accesses the page. This is a compliment to the strategy of rating only sites users actually visit.
- Include relevant categories from a policy enforcement standpoint
Early generation filtering products often inflated their reported coverage rates by creating meaningless catch-all or miscellaneous categories. This also inflated their number of categories, but added no value for policy enforcement.
- Recognise and categorise websites in a wide range of languages
The Internet is a global tool, and used by enterprises and organisations with offices worldwide. Therefore, the ability to categorise web pages and sites across a broad set of languages is critical for web filtering solutions.
Accuracy is the ability of a filtering product to precisely and consistently categorise sites. Accuracy answers the question, “Of the 100 websites the filter categorised as ‘X’ (Pornography, Spyware, Gambling, etc.), how many actually were ‘X’?” The higher the percentage, the greater the filter’s accuracy.
To achieve the highest accuracy, a web filtering product must be able to:
- Accurately categorise the sites users are ultimately attempting to access
Users can bypass early generation URL filtering through several widely-known techniques. All of these techniques use an intermediary Web page which pulls content that a user selects from an entirely different kind or category of Web page. Early generation filtering only “sees” the intermediary page, rather than the true destination content. Early generation filtering technology often only has a superficial rating, but this is not helpful for a policy.
- Place websites in multiple categories, as necessary
Web pages do not always fit easily into a single category. An accurate web filter would recognise this and classify the site into both of these categories, as many enterprises will allow access to sports sites, but block access to gambling sites altogether.
- Categorise subdirectories, as well as top-level domains
An accurate web filtering product should recognise sites that host home pages for users, and categorise the actual content on each specific URL.
- Process rating requests “on proxy”
To minimise impact on user productivity, and scale to the needs of large enterprises, a content filtering solution must be efficiently architected to deliver very high performance. Some commodity operating systems are inherently slower at processing rating requests. Common configurations, such as hosting the filtering intelligence in pass-by mode off-box, are inherently slow.
- Include IP ratings locally
Some early generation filtering systems attempt to provide ratings for the IP version of URLs in the database by performing a reverse-DNS lookup whenever just the IP is requested. However, this adds considerable latency to processing the rating request. Frequently, requests are handled so slowly an error message is returned instead of a rating. Such short-cuts only benefit the filtering vendor, not the user.
The nature of Web traffic and browsing habits has evolved far beyond early generation URL filtering architectures. Enforcing appropriate use policy and providing robust Web content security requires a truly dynamic filtering solution. Further, IT receives the visibility and control necessary to keep up with future challenges and opportunities.
Wayne Neich is the Country Manager of Blue Coat Systems, Australian and New Zealand.