Close window

Search this report:

Previous 1 | 2 | 3 | 4 | 5 | 6 Next

Technology landscape

Bringing structure to unstructured data

A scan of the technology landscape identifies increased investments in technologies and standards that allow organizations to bring structure to unstructured data.

In the interviews OCLC staff did with 100 professionals actively engaged in the creation, management and dissemination of information, there was a clearly expressed interest in technologies and methods that will allow information professionals (and end users) to bring structure to the vast amount of unstructured data that is available in today’s Information Mall. Increased user interest in unstructured or uncataloged information such as historical photograph collections, audio clips, research notes, genealogy materials and other riches hidden in library special collections has ignited conversations of how best to create metadata and methods to ensure dynamic and meaningful links to and among these currently unstructured information objects.

This drive to bring structure to unstructured data is being spurred by not only the library and information community, but by the business and government communities worldwide. It is estimated that 85 percent of the content in an enterprise is unstructured content3 and as enterprises look for new forms of competitive advantages, they are working to harness the power of this unstructured data.

Two dominant technical and structural approaches have emerged: a reliance on search technologies and a trend towards automated data categorization.

Search technologies

With the Web at 6 billion pages and growing, and organizational information page counts dwarfing that figure, finding what you what when you want it can be a daunting task. This problem has dominated the technology landscape in the last several years. The “killer app” solution is “search.”

Searching has become an international pastime. Over 625 million searches are conducted on the top eight search engines each day.4 Yet, even after five years of rapid growth, search engine technology is considered by many analysts to be in its early stages. The search engine arena is highly competitive, with nearly a hundred solutions on the market from companies ranging from upstarts like Endeca to the leaders Google, Yahoo! and Microsoft.

The following chart provides a brief overview of the top search technologies and sample vendors.

A survey of search technologies
Search technologies5 Definition Sample vendors
Boolean (extended Boolean) Retrieves documents based on the number of times the keywords appear in the text. Virtually all search engines
Clustering Dynamically creates “clusters” of documents grouped by similarity, usually based on a statistical analysis. Autonomy, GammaSite, Vivisimo
Linguistic analysis (stemming, morphology, synonym-handling, spell-checking) Dissects words using grammatical rules and statistics. Finds roots, alternate tenses, equivalent terms and likely misspellings. Virtually all search engines
Natural language processing (named entity extraction, semantic analysis) Uses grammatical rules to find and understand words in a particular category. More advanced approaches classify words by parts of speech to interpret their meaning. Albert, Inxight Software, InQuira
Ontology (knowledge representation) Formally describes the terms, concepts and interrelationships in a particular subject area. Endeca, InQuira, iPhrase, Verity
Probabilistic (belief networks, inference networks, Naive Bayes) Calculates the likelihood that the terms in a document refer to the same concept as the the query. Autonomy, Recommind, Microsoft
Taxonomy (categorization) Establishes the hierarchical relationships between concepts and terms in a particular search area. GammaSite, H5 Technologies, YellowBrix
Vector-based (vector support machine) Represents documents and queries as arrows on a multidimensional graph—and determines relevance based on their physical proximity in that space. Convera, Google, Verity
Source: Forrester Research, Inc.

One 2002 estimate suggests that Google search engines handle more questions in a day and a half than all the libraries in the U.S. provide in a year.6

There is little doubt that the rapid adoption of search technology has dramatically increased the power and productivity of the World Wide Web. Savvy Web users have become experts at maximizing search techniques to achieve the desired output but are also beginning to demand more sophisticated (or more structured) search methodologies. A group of high school students interviewed for this scan discussed how they have learned search techniques to find the information they need for school projects.

“[Search success] depends on how to do some of your searches. Because a lot of people say when they use search engines, they don’t find what they want but if you learn how to put your words in, you end up getting the results you want.” 7

—Marsadie, 16-year old girl

“Yeah, with Google, you can search within your results. Like, you can type in like a general word that like, say your report is about like the Cold War. You can type in ‘Cold War’ and it will come up with a bunch of stuff, then you can narrow it down like you just go search within results and then type in ‘Berlin Wall,’ or ‘Soviet Union,’ or something like that…’Arms Race’…and then it will narrow it down and you can usually get better results that way.”7

—Catherine, 16-year old girl

“… I actually tried doing research on a few different things but they came up invalid or just really not good. I found better information in just a regular book.”7

—James, 17-year old boy

As users become more experienced and more discriminating, the shortcomings of current search solutions are surfacing. While many students had become very skilled at finding what they wanted, all focus group participants felt that easier search methods are needed. The experts agree. Finding known objects in huge search spaces, assembling top-down overviews that summarize the important points of a topic, and helping searchers decide what they really want when their initial search ideas are confused, misguided or ambiguous are casting doubts on the long-term viability of today’s search techniques.8

Several technology analysts surveyed for this scan said that using today’s search technologies was simply “using brute force to solve the data discovery problem.” Search (or search alone) is not the long-term answer for superior information discovery.

Automatic data categorization—enabling the smarter “find”

Several data organization and description technologies and methodologies are gaining popularity as ways to address the void. Data organization techniques that library science has utilized for decades are becoming popular and important outside the information management community.

“The demand, outside the library community, for information about data organization and metadata is exploding,”9 say Gartner, Inc. technology analysts. In 2003 Gartner issued several research notes on metadata including, Enterprises Need a Metadata Integration Strategy10 and Taxonomy Creation: Bringing Order to Complexity.11

Many data categorization techniques are being applied across the landscape including: taxonomies, semantics, natural-language recognition, auto-categorization, “what’s related” functionality, data visualization, personalization and more. All techniques aim to help searchers find what they really want.

Data categorization is not new. “At one time, researchers speculated that solving such search problems might require artificial intelligence: systems that simulated human thought and could behave like skilled reference librarians. […] Until recently, however, IT applications required paid humans to think up the category names, define their relationships and write the rules that channeled data into the proper boxes. As a result, the technique was limited to fields with big budgets, such as financial analysis or defense. During the past few years, however, technology development has made it much easier to automate or at least semiautomate categorization.”12 Data categorization techniques are moving from manual activities, done by librarians and other information professionals, to automated processes executed on behalf of users.

“More and more information travels with a lengthening entourage of data about itself. Autocategorization software recognizes and leverages that data.”12 Information professionals have an opportunity to leverage these new technologies to bring information management methods to a large portion of today’s born-digital content.

Technology Landscape:Previous 1 | 2 | 3 | 4 | 5 | 6 Next