I’ve built a fairly simple UDF for stripping out HTML from a chunk of text. I’ll quickly explain my logic and what this UDF can do.
First, there are two types of tags in HTML. One is a wrapping tag. An example of this is the bold tag.
Next is a self-contained tag.
<img src="my/path/image.jpg" />
If the tag is a wrapping tag we need to make the decision of whether or not to delete the content the tag is wrapping. For example the content inside a bold tag we will most likely want to keep. While the content inside a script tag we will want to delete.
When it comes right down to it, we need to make three decisions about the HTML being deleted.
- What tag needs to be deleted.
- Is this a wrapping tag.
- If a wrapping tag, should the content be deleted.
So let’s take a look at the UDF.
<cfcomponent displayname="Strip HTML" hint="Strips out of a set of given text any specified html tags." output="false">
<!--- ************************ STRIP HTML ************************ --->
<cffunction name="stripHtml" displayname="Strip HTML" description="Strips out specified HTML tags." access="public" output="false" returntype="struct">
<!--- ARGUMENTS --->
<cfargument name="text" displayName="Text" type="string" hint="Text to strip out html tags from." required="true" />
<cfargument name="tags" displayName="Tags" type="string" hint="Tags to be striped from the text. Ex. '[string:tag name],[what to remove - {string:tag | string:content}],[is it a wrapping tag? {boolean}]'. Tags are delimited with semi-colons." required="true" />
<!--- SET SOME LOCAL VARS --->
<cfset var textbytes = "">
<cfset var counter = 1>
<cfset var delete = false>
<cfset var temp = "">
<cfset var tagtoberemoved = "">
<cfset var whatgetsremoved = "">
<cfset var wrappingtag = "">
<!--- BUILD STRUCT --->
<cfset var data = structNew()>
<cfset data.success = true>
<cfset data.message = "">
<cfset data.orginaltext = ARGUMENTS.text>
<cfset data.strippedtext = ARGUMENTS.text>
<!--- CHECK IF ALL CONTENT SHOULD BE REMOVED --->
<cfif ARGUMENTS.tags eq "all">
<!--- REMOVE HTML TAGS --->
<cfset data.strippedtext = rereplaceNoCase(ARGUMENTS.text, "<[^>]*>", "", "all")>
<cfelse>
<!--- LOOP OVER THE LIST OF TAGS TO BE REMOVED --->
<cfloop list="#ARGUMENTS.tags#" index="VARIABLES.i" delimiters=";">
<!--- SET ATTRIBUTES OF TAG TO BE DELETED --->
<cfset tagtoberemoved = listFirst(VARIABLES.i, ",")>
<cfset whatgetsremoved = listGetAt(VARIABLES.i, 2, ",")>
<cfset wrappingtag = listLast(VARIABLES.i, ",")>
<!--- IF REMOVING JUST THE TAG --->
<cfif whatgetsremoved eq "tag">
<!--- CHECK IF IT IS A WRAPPING TAG --->
<cfif wrappingtag eq true>
<!--- REMOVE WRAPPING TAG, BUT NOT THE CONTENT --->
<cfset data.strippedtext = rereplaceNoCase(data.strippedtext, "<#tagtoberemoved#>", "", "all")>
<cfset data.strippedtext = rereplaceNoCase(data.strippedtext, "</#tagtoberemoved#>", "", "all")>
<cfelse>
<!--- REMOVE CONTAINED TAG --->
<cfset data.strippedtext = rereplaceNoCase(data.strippedtext, "<#tagtoberemoved# />", "", "all")>
</cfif>
<!--- IF REMOVING TAG AND CONTENT --->
<cfelseif whatgetsremoved eq "content">
<!--- CHECK IF IT IS A WRAPPING TAG --->
<cfif wrappingtag eq true>
<!--- REMOVE THE TAG AND CONTENT --->
<cfset data.strippedtext = rereplaceNoCase(data.strippedtext, "<#tagtoberemoved#>.*</#tagtoberemoved#>", "", "all")>
<cfelse>
<!--- REMOVE CONTAINED TAG --->
<cfset data.strippedtext = rereplaceNoCase(data.strippedtext, "<#tagtoberemoved# />", "", "all")>
</cfif>
</cfif>
</cfloop>
</cfif>
<!--- RETURN STRUCT --->
<cfreturn data>
</cffunction>
</cfcomponent>
The method takes two arguments. First is the chunk of text to have HTML stripped out of. Second is information about the HTML to strip out of the text. The latter argument accepts a string. Multiple tags can be specified by separating them with semi-colons. For each tag that needs deleted three pieces of information needs to be provided. Each of the three pieces are delimited with a comma. These are the three arguments listed above. First is the name of the tag. So if we are wanting to remove an img tag we would specify “img”, for bold “b”. Second, we specify if we want just the tag removed or the tag and the content it wraps. There are two acceptable strings here – “tag” (for just removing the tag) and “content” (for removing the tag and the content). Lastly, specify whether this is a wrapping or contained tag. For an image tag we would specify “false” because it is not a wrapping tag. For bold we would specify “true” because it is a wrapping tag.
One last note, if you want to strip out all HTML tags you can just pass in the string “all” to the method. But note that this will just strip out the HTML and any content being wrapped by the tags will be left alone.
So let’s look at an example of how to call this tag.
<!--- TEXT TO STRIP HTML FROM --->
<cfset VARIABLES.text = "Lorem <img />ipsum <b>dolor</b> sit amet, <em>consectetur</em> adipiscing elit.>
<!--- STRIP OUT ALL HTML --->
<cfset stripHtml(VARIABLES.text, "all")>
<!--- STRIP OUT IMG, B, AND EM TAGS --->
<cfset stripHtml(VARIABLES.text, "img,tag,false;b,tag,true;em,content,true")>
Feel free to copy this and use it as you please.