Social Computing

Vol. 3 No. 9 – November 2005

Social Computing

Articles

Fighting Spam with Reputation Systems

Spam is everywhere, clogging the inboxes of e-mail users worldwide. Not only is it an annoyance, it erodes the productivity gains afforded by the advent of information technology. Workers plowing through hours of legitimate e-mail every day also must contend with removing a significant amount of illegitimate e-mail. Automated spam filters have dramatically reduced the amount of spam seen by the end users who employ them, but the amount of training required rivals the amount of time needed simply to delete the spam without the assistance of a filter.

Fighting Spam with Reputation Systems

VIPUL VED PRAKASH and ADAM O’DONNELL, CLOUDMARK

Spam is everywhere, clogging the inboxes of e-mail users worldwide. Not only is it an annoyance, it erodes the productivity gains afforded by the advent of information technology. Workers plowing through hours of legitimate e-mail every day also must contend with removing a significant amount of illegitimate e-mail. Automated spam filters have dramatically reduced the amount of spam seen by the end users who employ them, but the amount of training required rivals the amount of time needed simply to delete the spam without the assistance of a filter.

Considering that spam essentially consists of a single unwanted message seen by a large number of individuals, there is no reason why the training load associated with an automated spam filter can’t be distributed across that community of individuals. The community comes together and jointly classifies new messages as spam or not spam, and then allows those decisions to be distributed to the community.

by Vipul Ved Prakash, Adam O'Donnell

Information Extraction: Distilling Structured Data from Unstructured Text

In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.

Information Extraction: Distilling Structured Data from Unstructured Text

ANDREW McCALLUM, UNIVERSITY OF MASSACHUSETTS, AMHERST

In 2001 the U.S. Department of Labor was tasked with building a Web site that would help people find continuing education opportunities at community colleges, universities, and organizations across the country. The department wanted its Web site to support fielded Boolean searches over locations, dates, times, prerequisites, instructors, topic areas, and course descriptions. Ultimately it was also interested in mining its new database for patterns and educational trends. This was a major data-integration project, aiming to automatically gather detailed, structured information from tens of thousands of individual institutions every three months.

The first and biggest problem was that much of the data wasn’t available even in semi-structured form, much less normalized, structured form. Although some of the larger organizations had internal databases of their course listings, almost none of them had publicly available interfaces to their databases. The only universally available public interfaces were Web pages designed for human browsing. Unfortunately, but as expected, each organization used different text formatting. Some of these Web pages contained two-dimensional text tables; many others used a stylized collection of paragraphs for each course offering; still others had a single paragraph of English prose containing all the information about each course.

by Andrew McCallum

Social Bookmarking in the Enterprise

One of the greatest challenges facing people who use large information spaces is to remember and retrieve items that they have previously found and thought to be interesting. One approach to this problem is to allow individuals to save particular search strings to re-create the search in the future. Another approach has been to allow people to create personal collections of material—for example, the use of electronic citation bundles (called binders) in the ACM Digital Library. Collections of citations can be created manually by readers or through execution of (and alerting to) a saved search.

Social Bookmarking in the Enterprise

Can your organization benefit from social bookmarking tools?

DAVID MILLEN, JONATHAN FEINBERG, and BERNARD KERR, IBM

One of the greatest challenges facing people who use large information spaces is to remember and retrieve items that they have previously found and thought to be interesting. One approach to this problem is to allow individuals to save particular search strings to re-create the search in the future. Another approach has been to allow people to create personal collections of material—for example, the use of electronic citation bundles (called binders) in the ACM Digital Library. Collections of citations can be created manually by readers or through execution of (and alerting to) a saved search.

Perhaps the most familiar approach to “refinding” information on the Web has been through the use of personal bookmarks, supported by various Web browsers. For example, the Mozilla browser supports the creation of collections of URLs, which can be annotated using keywords or free-form text, and then sorted on a variety of dimensions (e.g., time last visited, keyword, location). An early study of bookmark use showed that people created bookmarks based on the quality and personal interest of the content, high frequency of current use, and a sense of potential for future use.1 Furthermore, the number of bookmarks contained in an individual collection grew steadily and roughly linearly, and the use of folders to categorize bookmarks increased as the size of the collection increased. A single level of folders was reported for collections with fewer than 300 bookmarks, whereas larger collections prompted multitiered hierarchies.

by David Millen, Jonathan Feinberg, Bernard Kerr

Curmudgeon

Stop Whining about Outsourcing!

I’m sick of hearing all the whining about how outsourcing is going to migrate all IT jobs to the country with the lowest wages.

The paranoia inspired by this domino theory of job migration causes American and West European programmers to worry about India, Indian programmers to worry about China, Chinese programmers to worry about the Czech Republic, and so on. Domino theorists must think all IT jobs will go to the Republic of Elbonia, the extremely poor, fourth-world, Eastern European country featured in the Dilbert comic strip.

Stop Whining About Outsourcing!

David Patterson, ACM

I’m sick of hearing all the whining about how outsourcing is going to migrate all IT jobs to the country with the lowest wages.

The paranoia inspired by this domino theory of job migration causes American and West European programmers to worry about India, Indian programmers to worry about China, Chinese programmers to worry about the Czech Republic, and so on. Domino theorists must think all IT jobs will go to the Republic of Elbonia, the extremely poor, fourth-world, Eastern European country featured in the Dilbert comic strip.

by David Patterson

Kode Vicious

Kode Vicious: The Doctor is In

Dear Kode Vicious, I've been reading your rants for a few months now and was hoping you could read one of mine. It's a pretty simple rant, actually: it's just that I'm tired of hearing about buffer overflows and dont understand why anyone in his or her right mind still uses strcpy(). Why does such an unsafe routine continue to exist at all? Why not just remove the thing from the library and force people to migrate their code? Another thing I wonder is, how did such an API come to exist in the first place?

The Doctor is In

KV is back on duty and ready to treat another koding illness: bad APIs. This is one of the most widespread pathologies affecting, and sometimes infecting, us all. But whether we write APIs or simply use APIs (or both), we would all do well to read on and heed the vicious one’s advice. And as always, your ongoing kode-related questions are welcomed and appreciated: kv@acmqueue.com.

Dear Kode Vicious,
I’ve been reading your rants for a few months now and was hoping you could read one of mine. It’s a pretty simple rant, actually: it’s just that I’m tired of hearing about buffer overflows and don’t understand why anyone in his or her right mind still uses strcpy(). Why does such an unsafe routine continue to exist at all? Why not just remove the thing from the library and force people to migrate their code? Another thing I wonder is, how did such an API come to exist in the first place?
Yours for Better APIs

by George Neville-Neil

Articles

Threads without the Pain

Much of today’s software deals with multiple concurrent tasks. Web browsers support multiple concurrent HTTP connections, graphical user interfaces deal with multiple windows and input devices, and Web and DNS servers handle concurrent connections or transactions from large numbers of clients.

Threads Without the Pain

Multithreaded programming need not be so angst-ridden.

ANDREAS GUSTAFSSON, ARANEUS INFORMATION SYSTEMS

Much of today’s software deals with multiple concurrent tasks. Web browsers support multiple concurrent HTTP connections, graphical user interfaces deal with multiple windows and input devices, and Web and DNS servers handle concurrent connections or transactions from large numbers of clients.

The number of concurrent tasks that needs to be handled keeps increasing at the same time as the software is growing more complex. Structuring concurrent software in a way that meets the increasing scalability requirements while remaining simple, structured, and safe enough to allow mortal programmers to construct ever-more complex systems is a major engineering challenge.

by Andreas Gustafsson

Interviews

A Conversation with Ray Ozzie

There are not many names bigger than Ray Ozzie's in computer programming. An industry visionary and pioneer in computer-supported cooperative work, he began his career as an electrical engineer but fairly quickly got into computer science and programming. He is the creator of IBM's Lotus Notes and is now chief technical officer of Microsoft, reporting to chief software architect Bill Gates. Recently, Ozzie's role as chief technical officer expanded as he assumed responsibility for the company's software-based services strategy across its three major divisions.

There are not many names bigger than Ray Ozzies in computer programming. An industry visionary and pioneer in computer-supported cooperative work, he began his career as an electrical engineer but fairly quickly got into computer science and programming. He is the creator of IBMs Lotus Notes and is now chief technical officer of Microsoft, reporting to chief software architect Bill Gates. Recently, Ozzies role as chief technical officer expanded as he assumed responsibility for the companys software-based services strategy across its three major divisions.