新足迹

 找回密码
 注册

精华好帖回顾

· 学习投资之翻译作:房产投资的尽职调查 (2007-12-24) Poweregg · 如果您来Adelaide...... (2005-10-5) roychen63
· 悲剧了: 提醒有小宝宝的妈妈,注意宝宝发育性髋关节脱位 3 楼更新, 手术回来 (2010-8-21) JuliaBear88 · 【还未成行,心已向往】2300公里狂奔,累并快乐着的新西兰之秋 (2012-4-14) 小狐仙
Advertisement
Advertisement
查看: 1362|回复: 5

Freaking smart ... [复制链接]

退役斑竹

发表于 2012-2-21 21:38 |显示全部楼层
此文章由 garysu 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 garysu 所有!转贴必须注明作者、出处和本声明,并保持内容完整
http://www.npr.org/templates/story/story.php?storyId=93605988

原来我们在一些网站输入验证码的时候,是在帮人家打工!忒聪明了 ⋯⋯ 可以直接去看红色部分

People who use the Internet to talk to friends, set up free e-mail accounts or buy concert tickets are often unknowingly helping to digitize vast libraries of old books and newspapers.

That's because more than 40,000 Web sites — including popular ones such as Ticketmaster, Facebook and Craigslist — are using a new kind of security program called reCAPTCHA.

It's the brainchild of Luis von Ahn, a computer scientist at Carnegie Mellon University in Pittsburgh, who helped develop another commonly used Web security system. That one, called CAPTCHA, will allow people to access a Web site only if they prove they are human — and not a spammer's computer — by typing in a sequence of letters or numbers that appear on the screen in a distorted or garbled image.

"Each time you type one of these, your brain is doing something amazing," von Ahn says. "Your brain is performing a task that, despite 50 years of research in computer science, we cannot yet get computers to do."

The trouble is, each time you type in one of these garbled words, you're also wasting time. Von Ahn recently realized exactly how much time was being wasted, and he found it demoralizing.

"Approximately 200 million of these are typed every day by people around the world. Each time you type one of these, essentially you waste about 10 seconds of your time," he says. "If you multiply that by 200 million, you get that humanity as a whole is wasting around 500,000 hours every day, typing these annoying squiggly characters."

But with reCAPTCHA, von Ahn has come up with an idea for harnessing all that human brain power.

He knew that lots of libraries have huge efforts under way to digitize their collections. These projects first scan books or newspapers by basically taking a picture of each page. Then a computer takes the image of each word and converts it into text, using optical character-recognition software.

But computers often come across printed words they just can't recognize. "Especially for older documents, things that were written before 1900, where the ink has faded and the pages have yellowed out, the computer makes a lot of mistakes," says von Ahn.

A human being has to look at those words and decipher them. It occurred to von Ahn that he could link this kind of activity to security devices used on the Internet. Instead of asking people to prove they're human by copying random sequences of distorted letters and numbers, he could ask them to decipher mystery words from scanned books and newspapers.

So he got together with The New York Times, which is digitizing newspapers going back to 1851, and a nonprofit called the Internet Archive, which is digitizing thousands of books.

And now, if you go to someplace like Ticketmaster to buy, say, Jimmy Buffett tickets, you'll be shown images of not one but two distorted words.

One of these is the real security word: Type this one correctly and you're in. The other image is something that has mystified the digitizing software.

If people recognize that word, they type it in. This image will actually be shown to several people. If they all agree on what the word is, it will be considered accurately transcribed. And von Ahn says it will be incorporated into the digitized copy of the book or the newspaper that it came from.

"And the number of words that we've been able to digitize like this is insanely large, it's like over a billion. It's like 1.3 billion by now," von Ahn says.


In the journal Science, he and his colleagues report that over the last year Web users have transcribed enough text to fill up more than 17,600 books, with better than 99 percent accuracy.

Marc Frons, chief technology officer of digital operations for The Times, says the pace is astonishing. Each month, the project digitizes about two years' worth of newspapers

"Next year, if all goes well, we can do as many as 70 years, which would be almost the entire rest of the archive that is not digitized," says Frons. "It's just pretty cool when you're signing up for a Web site and you see the reCAPTCHA sign. You sort of know, 'Gee, I'm helping digitize part of The New York Times.' "

People might wonder if this new system is wasting even more of their time than the traditional CAPTCHA setup, since it requires them to type in two different things instead of just one. But von Ahn says it's actually faster to type English words than to type random letters and numbers.

There is one problem. Sometimes, the book scanners offer up something that people can't read at all. "Like, for example, some sort of ink blot on the page," says von Ahn. "We might think it's a word and we present it, and you know, it says, 'Type the two words,' and sometimes one of the things is a word and one of the things is just a blob there. So sometimes people can be annoyed."

And here's another thing: "When you pull two random words from books, you can get some very random combinations," says Brian Pike, chief technology officer for Ticketmaster.

The two words can occasionally form juxtapositions that could be weird or offensive. "And there's certain phrases and words we've asked them to make sure don't show up," Pike says.

He declined to cite an example. Still, Pike says the system works great from a security standpoint. And if customers find it somewhat annoying, at least now they can know their time isn't being totally wasted.

这个文章相关的中文翻译版本:

reCAPTCHA计划是由卡内基梅隆大学所发展的系统,主要目的是利用CAPTCHA技术来帮助典籍数字化的进行,这个计划将由书本扫描下来无法准确的被光学文字辨识技术(OCR, Optical Character Recognition)识别的文字显示在CAPTCHA问题中,让人类在回答CAPTCHA问题时用人脑加以识别。reCAPTCHA正数字化《纽约时报》(New York Times)的扫描存盘,目前已经完成20年份的数据,并希望在2010年完成110年份的数据。2009年9月17日,Google宣布收购reCAPTCHA。

为了验证人类所输入的文字是正确的,而不是随意输入,有两个字会被显示出来;一部分是已知匹配的单词,另一部分则是需要帮忙OCR的未知单词。每个用户在正确输入了已知匹配的单词后,他所输入的未知单词部份将会被reCAPTCHA所记录加权。当一个未知单词的某个辨识版本得到了足够的权重时,这个单词就可以算作成功OCR了。对光学字符识别(OCR)软件来说,它们的辨识能力是有限的,尤其是那些印刷不清晰的旧书或残书,而人类可以凭借自己的阅读经验,轻松识别 那些 OCR 无法识别的文字。对这样的文字,人类的识别成功率可以达到99%,而OCR软件只能达到80%。

reCAPTCHA问题的所需的文字图片,首先会由reCAPTCHA计划网站利用Javascript API取得,在最终用户回答问题后,服务器再连回reCAPTCHA计划的主机验证用户的输入是否正确。reCAPTCHA计划提供了许多编程语言的库,让集成reCAPTCHA服务到现有程序的过程可以轻松些。除非有较大的带宽需求,否则reCAPTCHA原则上是一个免费的服务。

个人觉得reCAPTCHA这个服务的想法很不错,一方面可以避免SPAM,另一方面利用了人的大脑做了一些电脑所做不到的事情,一举两得,很有创意,也非常有用。

[ 本帖最后由 garysu 于 2012-2-21 22:21 编辑 ]
Advertisement
Advertisement

发表于 2012-2-21 21:46 |显示全部楼层
此文章由 Lucifer 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 Lucifer 所有!转贴必须注明作者、出处和本声明,并保持内容完整
以前确实发现只要输入一个单词就能过验证的情况。
头像被屏蔽

禁止发言

发表于 2012-2-21 21:54 |显示全部楼层

So cool !! with an accuracy better than 99% no one would complian the

此文章由 iami 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 iami 所有!转贴必须注明作者、出处和本声明,并保持内容完整
performance of the system and i m  pretty sure for the last per cent of error, its more likely being humanly unreadable.

was wondering after the texts are put back, how do they do the semantic and syntax check? by real human beings or sorta of programs ??
签名被屏蔽

退役斑竹 2007 年度奖章获得者 2008年度奖章获得者 特殊贡献奖章 参与宝库编辑功臣

发表于 2012-2-21 21:56 |显示全部楼层
此文章由 黑山老妖 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 黑山老妖 所有!转贴必须注明作者、出处和本声明,并保持内容完整
免费打工。。。

退役斑竹

发表于 2012-2-21 22:17 |显示全部楼层
此文章由 garysu 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 garysu 所有!转贴必须注明作者、出处和本声明,并保持内容完整
原帖由 iami 于 2012-2-21 21:54 发表
performance of the system and i m  pretty sure for the last per cent of error, its more likely being humanly unreadable.

was wondering after the texts are put back, how do they do the semantic and sy ...

this is even cooler than apple ...

退役斑竹

发表于 2012-2-21 22:20 |显示全部楼层
此文章由 garysu 原创或转贴,不代表本站立场和观点,版权归 oursteps.com.au 和作者 garysu 所有!转贴必须注明作者、出处和本声明,并保持内容完整
读了这个新闻,我另一个感想是,国内新闻网站大多数没能准确翻译出这个文章,也许翻译的人不熟悉这个。不过NPR这个文章真的很浅显,所以即使翻译的人不是IT人,应该也能翻译好。
Advertisement
Advertisement

发表回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则

Advertisement
Advertisement
返回顶部