What was I thinking?

Tags - Categories : Dave | Java | Life | Review | Ubuntu

I have seen several blogs and articles suggesting that a simple CAPTCHA implementation can be created using a form field hidden by CSS and then excluding any form submissions that include the 'poison' field. I'm not going to single out any particular article, but they are easy to find.

This idea always felt like a bad idea to me and something I have never felt like trying, but I had never been able to convince users that it should be avoided. It appears that spammers have caught on, and I now have proof.

We use jCaptcha on our site and it does the job for us, although I am prepared to admit that our safety has much to do with the fact that our traffic is relatively low. We err on the side legibility so that we don't annoy our users, but this in turn makes the CAPTCHAs easier to break.

Before plugging the test into our forms, we were being attacked on many of our client sites, and were often receiving several thousand faked submissions a day. To test the behaviour of the CAPTCHA, our implementation includes the time and IP of all submissions as well as indicating whether the test field was:

  • entered correctly
  • entered incorrectly
  • not entered (ie left blank)
  • not sent - only possible via automation

The main aim was to measure the percentage of failed CAPTCHAs as a measure of the annoyance caused on users. It wasn't too bad. It also shows some interesting behaviour in (alleged) spam attempts. Large numbers of incorrect, missing or null values are scattered through the CAPTCHA logs, but also very common are paired entries like this:

2008-09-26 12:49:49 | aaa.bbb.ccc.ddd | Incorrect Value
2008-09-26 12:49:51 | aaa.bbb.ccc.ddd | NULL

That is: an IP address would send a form submission and would be rejected due to an incorrect value. Two seconds later, the same IP will send the form data again without the CAPTCHA data. Not empty, just not sent.

A few things that should also be pointed out: We send a clear 'message failed' message (403 - forbidden) when the CAPTCHA fails since we don't want logs filled with spammers that aren't sure whether their message is getting through or not.

Back to the original message, and the behaviour I have seen suggests to me that spammers are aware of the 'hidden field' and if they are able to send submissions as seen above and succeed 50% of the time, that is a huge percentage compared to the computation and success rate with text recognition.

I hope this is enough to convince others that hidden fields are not a substitution for a real CAPTCHA and spam prevention is a real problem requiring stronger solutions.


Well, i know this might be a kinda annoyance but what about, instead of a math quiz that they'll eventually crack as well, a question based on the article which the commenter is supposed to have read and understood anyway ;-)
What if two or more hidden CAPTCHA fields are used? Or how about one or more hidden CAPTCHA fields and one visible?
and a CAPTCHA is no substitute for real comment moderation.
here's the thing. An intelligent web scraper is not going to base their form submissions on all the inputs on the page, but rather on the traffic they see going over the wire with httpfox or wireshark or what have you. I once wrote a small 'compiler' that took proxy logs and translated them directly into scraper directives. Much easier than screwing about parsing html or what not
As someone that writes CAPTCHA crackers as part of my job (no, not for spamming), I can assure you that a hidden field would in no way trip me up. As another poster mentioned, I check the over the wire traffic, and don't pay much attention to what happens to be in the HTML of the form.
What about dynamically generated pages? If every page generated has unique names for all form fields, simply watching the traffic to the site is worthless. To keep it stateless you could add a hidden field that has an encrypted string that encodes what field is the "real" field.

For good measure you could also randomize the order of the sections in HTML, positioning everything with CSS.

Try to automate the scraping of a page like that without parsing the HTML and css. Of course just messing with the form names would prevent 99% of current bots. Not as fast as serving static pages, of course, but if you only use the dynamic generation on a "please prove you're human" page it should be fine.

I apologise for omitting it from the article, but hidden fields also fail 'record and playback' attacks, similar to the 'over the wire' comments above. If a successful payload is run by a human, recorded and then run repeatedly the hidden field would be useless.
I think i know a way to make captchas harder to get auto filled in. There is google webtoolkit that can handle object passing. This requires however that the server have tomcat or glassfish. I dont think people can be able to debug how the objects are getting send though ajax yet using google's method in gwt.


Add a comment

Title
Body
HTML : b, i, blockquote, br, p, pre, a href="", ul, ol, li
Math Quiz 5 + 7 = (Helps stop blog spam)
Name
E-mail address
Website
Remember me Yes  No 

E-mail addresses are not publicly displayed, so please only leave your e-mail address if you would like to be notified when new comments are added to this blog entry (you can opt-out later).

TrackBack to http://radio.javaranch.com/davo/addTrackBack.action?entry=1224063498569