Allowing for User Mischief - A Bad Word Filter

12. April 2003 15:39 by Chris in // Tags: // Comments (1)

Note that this article was first published on 02/01/2003. The original article is available on DotNetJohn, where the code is also available for download

Introduction

This article presents an improved core implementation of a solution to a particular problem I come across occasionally: detection and/ or removal of suspect words from user supplied text on web sites. A typical application scenario might be a discussion forum. For example, I’ve worked on a few sports related web sites where discussions can become …heated and the language used occasionally strays into that inappropriate for the general audience of the site. There are several approaches to dealing with this problem, some of which are discussed in my previous articles on this subject published on ASPAlliance ( http://www.aspalliance.com/sullyc/articles/user_mischief.aspx - no longer available) and 15Seconds ( http://www.15seconds.com/issue/030121.htm - no longer available). The latter article looks at a composite control based implementation by the way. As indicated in these articles I suggest the best way would be identification and removal of any suspect words.

A Starting Point

My previous implementation was based on a word and/ or word fragment ('word roots') list defined with an XML document. The items from this list were then compared against the user-inputted text string and matches highlighted using the string manipulation functionality available in .NET. For re-usability I decided on a user control. See the first article ( http://www.aspalliance.com/sullyc/articles/user_mischief.aspx ) for full details but the core of the implementation is reading the XML into a local data structure for subsequent direct comparison, in this case an ArrayList:

 <%@Control Language="VB"%>
 <%@ Import Namespace="system.Xml" %>
 
 <script language="VB" runat="server">
 
 Dim alWordList As new ArrayList
 
 Sub Page_Load()
   dim xmlDocPath as string = server.mappath("bad_words.xml")
   dim xmlReader as XmlTextreader = new xmlTextReader(xmlDocPath)
   while (xmlReader.Read())
     if xmlReader.NodeType=xmlNodeType.Text then
       alWordList.Add(xmlReader.Value)
       trace.write("Added: " & xmlReader.Value)
     end if
   end while
   xmlReader.Close()
 End Sub
 
 Public Function CheckString(InputString as String) as string
   dim element as string
   dim output as string
   trace.write("Checking " & InputString)
   For Each element in alWordList
     trace.write("Checking: " & element)
     InputString=InputString.Replace(element,"****")
   Next
   trace.write("Returning " & InputString)
   Return InputString
 End Function
 
 </script>

with the XML file being of the format:

 <?xml version="1.0"?>
 <words>
   <word>word root 1</word>
   <word>word root 2</word>
 </words>

With the actual words replaced to protect the innocent.

Then all that remains is capturing of user text, via a textbox perhaps, registering the user control for use in the page:

 <anti_swear:cleanup id="cleanup1" runat="server" />

and using the control to check the inputted text for ‘naughty’ word roots:

 dim clean_text as string
 
   clean_text=tbMessage.text ‘text to be checked
   trace.write("message text: " & clean_text)
 
   clean_text=cleanup1.CheckString(clean_text)
   trace.write("message text (cleaned): " & clean_text)
 
   if clean_text<>tbMessage.text then
     trace.write("Text not clean!")
     tbMessage.text=clean_text
     lblInfo.Text="Naughty words found ... please remove!"
   else
     'all is OK … submit to db/ other permanent store for later recall

So, in this simple implementation CheckString returns a string with naughty word roots replaced by ‘****’ and we can detect if such words have been found as the returned text will be different from that passed into the function.

The actual detection is very simple:

 InputString=InputString.Replace(element,"****")

too simple in fact as we’ll shortly explore.

Note that in the XML document I’ve used the phrase ‘word root 1’: important as it is only the roots of suspect vocabulary that you need to place in the XML document, thus reducing the effort involved for you. This should limit the number of XML elements you need to introduce to cover the commonly used expletives but also means care must be taken not exclude perfectly acceptable words.

The Problem

What’s the problem? Well, you may well have already realised that as pointed out to me by my fellow ASPAlliance columnist Jonathan Cogley (and as alluded to in the last paragraph of the last section):

Your approach uses a regular text replace which could create a new problem. Since it will identify the offending sequence of letters in perfectly harmless words e.g. scunthorpe would be rendered as s****horpe.

Jonathan suggested two possible solutions, with my thoughts on implementation also below:

a white word list – a list of permissible words to be checked if a word on the black word list was found. This approach I believe to be too prone to programmer 'error' – there are too many language combinations to provide a sleek solution.
regular expressions – the powerful language of regular expressions should be able to provide a better matching algorithm that would alleviate the problem.

The Solution

Let’s consider and see if we can find a better solution. An obvious starting point is the example exception above and thus the regular expression concept of word boundaries.

Scunthorpe (a town in the UK for our international readers and possible towns elsewhere in the world for all I know).

As we’re interested in roots of words we’d prefer that Scunthorpe not match because it starts with an S and hence shouldn’t be offensive to anyone. However we are interested in matching any derivatives of our dubious root words so whilst we want to specify the beginning word boundary we shouldn’t be interested in the ending word boundary.

In regular expressions word boundaries are identified via the concept of an anchor. Anchors specify the position where the pattern occurs. For example:

^ Matches at the start of a line.
$ Matches at the end of a line.
\< Matches at the beginning of a word.
\> Matches at the end of a word.
\b Matches at the beginning or the end of a word.
\B Matches any character not at the beginning or end of a word.

Thus the above include a few options we’re interested in. Let’s use \b, the word boundary anchor. This represents anything that can come before or after a word, e.g. white space, punctuation and/or the beginning or end of a line.

So we want to engage in a regular expression search / replace for '\broot word'. This should solve our problem. How do we do this in .NET?

Regular Expression Solution in .NET

We’re going to focus on solving this little problem and shall not be considering the range of extensive support for Regular Expressions in .NET. However, look out for such an article on dotnetjohn in the near future.

There are a variety of supporting classes we could use:

Regex: the Regex class represents a regular expression. It also contains static methods that allow use of other regular expression classes without explicitly instantiating objects of the other classes.

Match: the Match class represents the results of a regular expression matching operation.

MatchCollection: the MatchCollection class represents a sequence of successful non-overlapping matches.

An example of how we might utilize the Regex class is:

 Dim r As Regex = New Regex("\b" & “NaughtyRoot”)

Further, among the members of the Regex class are:

IsMatch - indicates whether the regular expression finds a match in the input string.

Match - searches an input string for an occurrence of a regular expression and returns the precise result as a single Match object.

Matches - searches an input string for all occurrences of a regular expression and returns all the successful matches as if Match were called numerous times.

Replace - replaces all occurrences of a character pattern defined by a regular expression with a specified replacement character string.

In line with our previous implementation we would use the Replace function, replacing our CheckString function with:

 Public Function CheckString(InputString as String) as string
   Dim r As Regex
   dim element as string
   dim output as string
   trace.write("Checking " & InputString)
   For Each element in alWordList
     r = New Regex("\b" & element)
     trace.write("Checking: " & element)
     InputString=r.Replace(InputString,"****")
   Next
   trace.write("Returning " & InputString)
   Return InputString
 End Function

Which does indeed do what we wish. One caveat is that as we are only checking for the beginning of words some swear words may slip through the net if we don’t explicitly add them to the bad words list. ‘Motherf**ker’ is an example. I can’t think of an easy way around this problem however. You could extend the solution to include end of word boundaries but then you need to include ‘f**ker’ as well as ‘f**k’, for example. Plus, you increase the risk of trapping valid words.

Note also that the provided solution is not perfect on the grounds that some valid words will no doubt still be challenged by this solution. I do believe it is a good compromise, however. It might be a good option to change the language of the interface to indicate the presence of 'possibly suspect words' and to let the user edit the text. It should be obvious to the user why their text has been returned to them.

Conclusion

I hope this article has provided a useful extension to my earlier articles on the subject and in doing so introduced some readers to the powerful language provided by regular expressions. If you’d like to raise any points about this article, in particular thoughts on how the solution could be improved, email me (sullyc-olops@btinternet.com ).

The Zipfile

The zipfile includes the following:

markII.aspx web form page with text box and calling user control methods.

user_controls
/anti_swear.ascx string based version
/anti_swear2.ascx regular expression based version
/bad_words.xml

To use, populate bad_words.xml and alter the user control reference in markII.aspx to see the differences between the versions.

You may download the code here.

About the author

I am Dr Christopher Sully (MCPD, MCSD) and I am a Cardiff, UK based IT Consultant/ Developer and have been involved in the industry since 1996 though I started programming considerably earlier than that. During the intervening period I've worked mainly on web application projects utilising Microsoft products and technologies: principally ASP.NET and SQL Server and working on all phases of the project lifecycle. If you might like to utilise some of the aforementioned experience I would strongly recommend that you contact me. I am also trying to improve my Welsh so am likely to blog about this as well as IT matters.

Allowing for User Mischief - A Bad Word Filter

Introduction

A Starting Point

The Problem

The Solution

Regular Expression Solution in .NET

Conclusion

About the author

Category list

Tag cloud

Month List

Newsletter

Thank you