티스토리 수익 글 보기
skip to main |
skip to sidebar
Mastodon Eric’s feed.
This machine surrounds hate and forces it to surrender.
Monday, April 12, 2010
Beware, Comment Spammers!
I had this great idea about how to fight comment spam. If you’re not familiar with comment spam, you probably don’t have your own blog and you think that “Kathryn” and “Patrick” who try to comment on this blog are just brain dead people. You might be right about the brain dead part, but I’m not sure they’re really people.
Do you ever wonder why commenting on blogs can be such a hassle, or why so many blogs require moderation, or why many blogs don’t accept comments on older posts, or forbid links in comments? It’s because of comment spam. Spammers will submit comments such as “Your post is helpful and informative” or “We need to pay attention to the eco friend environment” that don’t address the topic of the post in question. I’m not talking about targeted self-promotion here. It’s not comment spam to link to an article you wrote on a similar topic, but it’s definitely comment spam if you use a robot to do so. Or if you hire people in Asian boiler rooms to get around the CAPTCHA‘s that stop your robots.
It used to be that comment spam was done to improve the search engine ranking of websites. That motivation has largely gone away with the development of the
I guess the people who have been leaving spam comments on my blog didn’t get that memo. It’s annoying to have to delete the comments, especially the ones in Chinese where links get hidden around the periods in “...“. I went to the Blogger help pages to see if there’s any way to report the abusive commenters (this blog restricts anonymous comments, so there’s at least a user profile for every comment). There isn’t. What’s worse, Google tells you that if you don’t remove those spam comments, your site’s ranking will be hurt. Then I had my bright idea. I clicked on one of the links left in the spam comment. Then I picked some keywords from the page and plugged them into Google to find the site. There, at the bottom of the search result, was an option: Dissatisfied? Help us improve. Google is asking for feedback. I pasted in the URL for my comment spammer’s site, and checked the radio button labeled “The results included spam.” I clicked send, and my spammer’s site was bound for Google oblivion!
Beware, comment spammers, I’m going to report you!
Though I felt good about it, I started to have doubts. A lot of these comment spammers seemed to be Asian; could it be that Asian search engines didn’t get the
Well maybe not. I did a few searches in Baidu. Baidu is probably the worst internet search engine I’ve ever tried! Baidu gives really stupid results for my vanity search. Baidu doesn’t index my blog, my website, or anything I’ve ever posted. Perhaps China has blacked out the entire Google network, including Blogger, and Baidu doesn’t see it any more. Or perhaps “Go To Hellman” has been banned for its post on Qin Shi Huangdi. Baidu has spidered a page from WorldCat that mentions some other Eric Hellman, and has picked up blog mentions of my by John Blyberg and in Dear Author but not much else. It’s safe to assume that Baidu’s strength is not English-language indexing.
So if Baidu doesn’t index my blog, then spammers shouldn’t be able to improve their Baidu rankings with comment spam in my blog. There must be some other motivation for the comments.
Another thing I noticed is that Baidu seems to be big on searching for MP3’s and PDF’s. It ranks sites like Rapidshare rather highly. Maybe Baidu and similar search engines spider websites like my blog to discover the mp3 files, the PDFs, and the video files that Baidu users are really looking for, and the intended audience of the spam comments is these content spiders. My blog has discussed ebooks, piracy and related topics, so maybe the spammers think its a good source for links to content. Who knows?
Another possibility is that the spammers are trying to get bloggers themselves to visit the their sites. “Patrick” from Madras is trying to sell “web templates”. It turns out that his site has copied content from another site marketing web templates, which appear to me to be copies of other websites with much of the content stripped out. It’s ironic: Patrick seems to be using a template for a web-template selling website to sell web templates.
After a few days, I checked back to see if the website I had complained about had been removed from Google or not. As it turns out, the site actually improved its Google ranking from #5 to #1 in my test search. So much for my career in comment spam scourgedom!
Do you ever wonder why commenting on blogs can be such a hassle, or why so many blogs require moderation, or why many blogs don’t accept comments on older posts, or forbid links in comments? It’s because of comment spam. Spammers will submit comments such as “Your post is helpful and informative” or “We need to pay attention to the eco friend environment” that don’t address the topic of the post in question. I’m not talking about targeted self-promotion here. It’s not comment spam to link to an article you wrote on a similar topic, but it’s definitely comment spam if you use a robot to do so. Or if you hire people in Asian boiler rooms to get around the CAPTCHA‘s that stop your robots.
It used to be that comment spam was done to improve the search engine ranking of websites. That motivation has largely gone away with the development of the
"nofollow" tag. Blogs such as “Go To Hellman” attach add rel="nofollow" to any links in the comment threads. This tells spidering robots not to follow the specified links and tells search engines to ignore the links for purposes of site ranking.I guess the people who have been leaving spam comments on my blog didn’t get that memo. It’s annoying to have to delete the comments, especially the ones in Chinese where links get hidden around the periods in “...“. I went to the Blogger help pages to see if there’s any way to report the abusive commenters (this blog restricts anonymous comments, so there’s at least a user profile for every comment). There isn’t. What’s worse, Google tells you that if you don’t remove those spam comments, your site’s ranking will be hurt. Then I had my bright idea. I clicked on one of the links left in the spam comment. Then I picked some keywords from the page and plugged them into Google to find the site. There, at the bottom of the search result, was an option: Dissatisfied? Help us improve. Google is asking for feedback. I pasted in the URL for my comment spammer’s site, and checked the radio button labeled “The results included spam.” I clicked send, and my spammer’s site was bound for Google oblivion!
Beware, comment spammers, I’m going to report you!
Though I felt good about it, I started to have doubts. A lot of these comment spammers seemed to be Asian; could it be that Asian search engines didn’t get the
nofollow memo either? Some quick googling confirmed my suspicion, China’s leading search engine, Baidu, doesn’t pay attention to the nofollow attribute! These comment spammers must be using my blog to juice their Baidu ranking!Well maybe not. I did a few searches in Baidu. Baidu is probably the worst internet search engine I’ve ever tried! Baidu gives really stupid results for my vanity search. Baidu doesn’t index my blog, my website, or anything I’ve ever posted. Perhaps China has blacked out the entire Google network, including Blogger, and Baidu doesn’t see it any more. Or perhaps “Go To Hellman” has been banned for its post on Qin Shi Huangdi. Baidu has spidered a page from WorldCat that mentions some other Eric Hellman, and has picked up blog mentions of my by John Blyberg and in Dear Author but not much else. It’s safe to assume that Baidu’s strength is not English-language indexing.
So if Baidu doesn’t index my blog, then spammers shouldn’t be able to improve their Baidu rankings with comment spam in my blog. There must be some other motivation for the comments.
Another thing I noticed is that Baidu seems to be big on searching for MP3’s and PDF’s. It ranks sites like Rapidshare rather highly. Maybe Baidu and similar search engines spider websites like my blog to discover the mp3 files, the PDFs, and the video files that Baidu users are really looking for, and the intended audience of the spam comments is these content spiders. My blog has discussed ebooks, piracy and related topics, so maybe the spammers think its a good source for links to content. Who knows?
Another possibility is that the spammers are trying to get bloggers themselves to visit the their sites. “Patrick” from Madras is trying to sell “web templates”. It turns out that his site has copied content from another site marketing web templates, which appear to me to be copies of other websites with much of the content stripped out. It’s ironic: Patrick seems to be using a template for a web-template selling website to sell web templates.
After a few days, I checked back to see if the website I had complained about had been removed from Google or not. As it turns out, the site actually improved its Google ranking from #5 to #1 in my test search. So much for my career in comment spam scourgedom!
Subscribe to:
Post Comments (Atom)
- Project Gutenberg
- Inventor of the ebook as we know it.
- Free Ebook Foundation
- Making the world of ebooks safe for the free.
- Unglue.it
- 150,000 Free ebooks.
- Bluesky
- Eric’s Bluesky.
Blog Archive
Popular Posts
-
Personal Note, January 1 2026: I have a new job: Executive Director of the Project Gutenberg Literary Archive Foundation . Here's what I…
-
Back when the web was new, it was fun to watch a website monitor and see the hits come in. The IP address told you the location of the user…
-
"Good thing downloads NOT trackable!" was one twitter response to my post imagining a skirmish in the imminent scholarly publi…
-
Depending on the map provider you're using, there may be a street running through my kitchen. After driving through my kitchen, perhap…
-
In mathematics, catastrophe theory is the study of nonlinear dynamical systems which exhibit points or curves of singularity. The behavior …
Go To Hellman Fan Page
Go To Hellman on Facebook
Labels
- ebooks (94)
- Libraries (72)
- book industry (52)
- E-book (49)
- privacy (49)
- Copyright (48)
- business models (45)
- linked data (33)
- Semantic web (28)
- Open Access (26)
- Ungluing Ebooks (26)
- Creative Commons (23)
- physics (23)
- Google Book Search (21)
- Publishing (21)
- Twitter (21)
- Web Design and Development (21)
- library automation (21)
- Google (20)
- Unglue.it (20)
- Piracy (19)
- Gluejar (18)
- magic (18)
- social practice (18)
- ALA Midwinter (17)
- RDF (17)
- Overdrive (16)
- linking technology (16)
- metadata (16)
- scholarly publishing (16)
- Amazon (15)
- Amazon Kindle (15)
- identifiers (15)
- Book Use (14)
- Digital rights management (14)
- ALA Annual (13)
- Conferences (13)
- Google Book Search Settlement (13)
- HarperCollins (13)
- Crossref (12)
- EPUB (12)
- OpenURL (12)
- facebook (12)
- Just Kidding (11)
- New York Times (11)
- RDFa (11)
- Truth (11)
- Big Library Read (10)
- Book Digitization (10)
- HTTP Secure (10)
- The Four Corners of the Sky: A Novel (10)
- isbn (10)
- Blogging (9)
- Public library (9)
- Bugs (8)
- Denny Chin (8)
- IDPF (8)
- Project Gutenberg (8)
- URL redirection (8)
- knowledgebases (8)
- languages (8)
- semtech2009 (8)
- social networks (8)
- wikipedia (8)
- Attributor (7)
- Book Rights Registry (7)
- Hackathon (7)
- Kickstarter (7)
- Library (7)
- New Jersey (7)
- RA21 (7)
- bit.ly (7)
- Apple (6)
- DOI (6)
- Digital library (6)
- Google Books (6)
- IPad (6)
- India (6)
- Newspaper industry (6)
- Open Source (6)
- Public Domain (6)
- running (6)
- semantic technology (6)
- Digital Object Identifier (5)
- Entrepreneurship (5)
- Intel (5)
- Interlibrary loan (5)
- Library journal (5)
- Microdata (5)
- OCLC (5)
- Star Trek (5)
- authentication (5)
- crowdfunding (5)
- public identity (5)
- Aaron Swartz (4)
- Amazon Web Services (4)
- American Library Association (4)
- Bell Labs (4)
- Bitcoin (4)
- Brian O'Leary (4)
- Code4Lib (4)
- DPLA (4)
- Electronic Journals (4)
- Google Analytics (4)
- J. K. Rowling (4)
- Koha (4)
- Liblime (4)
- LibraryThing (4)
- Neal Stephenson (4)
- Publishing Point (4)
- SOPA (4)
- Sweden (4)
- my attic (4)
- my dad (4)
- Accessibility (3)
- AdWords (3)
- Adobe Digital Editions (3)
- Baseball (3)
- Bruce Springsteen (3)
- Cryptography (3)
- Forms of government (3)
- Geolocation (3)
- GitHub (3)
- Google Wave (3)
- JSTOR (3)
- Macmillan (3)
- Network Effect (3)
- New York Public Library (3)
- OWL (3)
- PTFS (3)
- Search Engine Optimization (3)
- blockchain (3)
- death (3)
- genealogy (3)
- hashtags (3)
- http-range (3)
- iPhone (3)
- poetry (3)
- politics (3)
- security (3)
- unicode (3)
- Advertising (2)
- Americans with Disabilities Act of 1990 (2)
- Book Design (2)
- Book Industry Study Group (2)
- Bots (2)
- Database Licensing (2)
- Disruptive technology (2)
- Electronic Frontier Foundation (2)
- FRBR (2)
- Fair use (2)
- Fan Fiction (2)
- File sharing (2)
- Fusion Tables (2)
- Gitenberg (2)
- Google Book (2)
- Great Gatsby (2)
- Hachette Book Group (2)
- Hal Varian (2)
- Hurricane Sandy (2)
- Internet Archive (2)
- John Sundman (2)
- Nook (2)
- OpenID (2)
- OpenSource (2)
- Payments (2)
- Philadelphia Phillies (2)
- Proxy server (2)
- Radiolab (2)
- Random House (2)
- Rush Holt (2)
- School library (2)
- Social network (2)
- Spam (2)
- Star trek TNG (2)
- Vegetables (2)
- Wolfram Alpha (2)
- ebrary (2)
- linkedin (2)
- technology (2)
- tr.im (2)
- AdaptiveBlue (1)
- Assistive Technology (1)
- Beer (1)
- Bibliocommons (1)
- Brewster Kahle (1)
- Clay Johnson (1)
- Clayton M. Christensen (1)
- Comic Con (1)
- DBpedia (1)
- DCWG (1)
- Dave Winer (1)
- Digital watermarking (1)
- EBL (1)
- Evan Ratliff (1)
- Evert Taube (1)
- Firefox (1)
- GNU Affero General Public License (1)
- Garage sale (1)
- Hugh Howie (1)
- Ian Davis (1)
- Infochimps (1)
- Infrastructure (1)
- Instant Messaging (1)
- Jon Stewart (1)
- Knowledge representation (1)
- Kobo (1)
- Lawrence Lessig (1)
- Mac OS X (1)
- Metcalfe's Law (1)
- Neil Gaiman (1)
- Neurobiology (1)
- ORCID (1)
- Open Database License (1)
- Open Knowledge Foundation (1)
- Open Library (1)
- PDDL (1)
- Paypal (1)
- ProQuest (1)
- PubMed (1)
- Qin Dynasty (1)
- Qin Shi Huangdi (1)
- RV Guha (1)
- Ralph Waldo Emerson (1)
- SPARQL (1)
- Simon and Schuster (1)
- Single sign-on (1)
- Siri (1)
- Star Wars (1)
- Text-To-Speech (1)
- Textbooks (1)
- The Hitchhiker's Guide to the Galaxy (1)
- Tim O'Reilly (1)
- Tor (anonymity network) (1)
- Warner Oland (1)
- Weeds (1)
- YouTube (1)
- Zemanta (1)
- Zola Books (1)
- dead serious (1)
- design patterns (1)
- family (1)
- gmail (1)
- h1n1 (1)
- life (1)
- music (1)
- patents (1)
- shibboleth (1)
- swedish music (1)
- twitterdata (1)
If you are a Comment Spammer, comments are closed to you.
Your use of this material is subject to the Go To Hellman Blog License Agreement. This blog uses StatCounter analytics; they set a tracking cookie that may spy on you.


I started to make a list of amusing spam comments like you mentioned (see it here: http://docs.google.com/View?id=dfr2jdcs_262gxmwmrfd)
ReplyDeleteI had 100's of spam comments on my blog every day. I noticed the vast majority were submitted for only one post, about ebooks (http://commonplace.net/2009/11/is-an-e-book-a-book/). Eversince I disabled comments for just that one post, I only get very few spam comments anymore.
I get a couple comment spams a day on my blog. They're automatically detected and blocked though. There are a few tricks you can use to detect bot behaviour. I wrote about the one I use on my website here:
ReplyDeletehttps://secure.grepular.com/Blocking_Comment_Spam_Using_ModSecurity_and_Hidden_Fields
This method hasn't let a single bot comment through in months.
I've noticed that comment spammers find their targets using search terms such as inurl:"node" intext:"post a comment" -"comments are closed" writing service. Thus, the invisibility incantation to the right –>
ReplyDeleteStopping coment spammers is really easy – add a mod to your site that has a random text question a human user has to answer as well as capcha.
ReplyDeleteSecondly analyzer your logfiles and traffic from ips and use project honey pot to determine if they are spammers. Then block those ips or ranges at the server level. Linux/apache users can easily use mod rewrite and htaccess. Windows/IIS users simply use deny access.
Captcha is now not such affective. Spammers are using De-captcha software to avoid that. I think the best option is question answer. Means asking a question whose answer can only retrieved by search engine.
ReplyDeleteInvesting in Property
Captcha should be some advance now. There are some software that can decaptcha those images.There should some scrolling task or game which can only be played by mouse, that will help to reduce spammers.
ReplyDeleteProperty Investment