We have create an algorithm based on the specifications below and tested it on several hundred pages. The simple algorithm is:
The algorithm works quite well at detecting ASCII art. It misses very few ASCII art images and gives a small number of false positives. Two of the things it detects as ASCII art are computer programming source code examples and guitar tabulature.
Studying various samples of ASCII art on the Internet, I noticed that all of the ASCII art is enclosed within < XMP> or < PRE> elements. The text within these two tags are preformatted in monospace type font. This enables the user to specify exactly where your text, input boxes, pick lists, etc., should appear on the page, thus what you see on screen is what you get.
I've also come across many ASCII art samples saved in *.txt file format, viewable on any browser in monospace types and some that are converted into gif format. However, these cases are irrelevant to this particular study.
As for the content within these two tags, ASCII art is generally characterized by one or more occurrences of 4 or more consecutive use of the same character. The characters can be any ascii characters (alphabets, numbers, symbols, etc). An example of this ascii art is here.
There were many cases of ASCII art where the above pattern does not apply - this pattern of 4 or more of the same consecutive letters is not seen at all. However, in these particular cases, a pattern of 4 or more blank spaces in succession can be used to detect ascii art. Sample of ASCII art with 4 or more blank spaces.
I noticed extensive use of some ASCII characters such as '|', underline characters, and others that resemble some sort of line more than others. However, some ASCII art was entirely done using alphabetical characters and others were entirely done using numeric characters. Therefore, there seems to be no definite pattern or preference in the type of ASCII characters used.
In general, ASCII art is displayed using more than 10 lines. This makes it easier for the user to represent lines and shapes more distinctly in ASCII art. One pattern I noticed is that when the ASCII art is small (displayed in less than 10 lines) mostly non-alphabetical characters are used in its art. Here is an example of small ASCII art. This particular rule would imply the use of a dictionary to store all the non-alphabetic characters for comparisons. This may pose a problem.
On the other hand, when the ASCII art's magnitude is large, it makes use of all types of characters. Here is an example of a large ASCII art.
In general, I found ASCII art to be very ambiguous and without any definite pattern. There seems to be some general rule that defines what ASCII art is but devising rules that can be used to detect ASCII art in HTML document seems to be a difficult task. ASCII art can not be defined by one particular pattern/rule but by many rules. Therefore, detecting ASCII art in HTML document would require use of several conditions (rules) in conjunction.
The following is the summary of ASCII art detection rules:
|Example||Rule 1||Rule 2A||Rule 2B||Rule 3A||Rule 3B|
The pattern is:
Last Modified: March 10, 2000