Mining the Michael Hunter Reference Librarian Hobart and William Smith Colleges For Western New York Library Resources Council Member Libraries Staff Sponsored by the Western New York Library Resources Council

For today . . . From Web to Deep Web Search Services: Genres and Differences The Topography of the Internet Mining the Deep Web: Techniques and

Tips Hands-on Session Evaluating Deep Web Resources Using Proprietary Software Web to Deep Web 1991 Gopher Menu-based text only You had to KNOW the sites

1992 Veronica Menus of menus Difficult to access Web to Deep Web 1991 - Hyper-Text Markup Language Linkage capability leads you to related information elsewhere

Classic Web Site Relatively stable content of static, separate documents or files Typically no larger than 1,000 documents navigated via static directory structures Web to Deep Web 1994 Lycos launched First crawler-based search engine with database of 54,000 html documents (CMU)

Growth of html documents unprecedented and unanticipated 2000 (April) The Web is doubling in size every 8 months (FAST) Web to Deep Web 1996 Three phenomena pivotal for

the development of the Deep Web: HTML-based database technology introduced Bluestones Sapphire/Web, Oracle Commercialization of the Web Growth of home PC-users and e-commerce Web Servers adapted to embrace dynamic serving of data

Microsofts ASP, Unix PHP and others Web to Deep Web 1998 Deep Web comes of Age Larger sites redesigned with a database orientation rather than static directory structure U.S Bureau of the Census Securities and Exchange Commission Patent and Trademark Office Search Services:

Genres and Differences Exclusively crawler-created Search engines Meta search engines Human created and/or influenced Directories Specialized search engines Subject metasites Deep Web gateway sites



CR DATABASE CR - Crawler WS - Web Server User 1 User 2 User 3 DATABASE Search

Engine User 4 User 5 User 6 User 7 Search Services: Exclusively Crawler Created Database compiled through

automated, link-dependent crawling and site submission Unable to access Dynamically-created pages Proprietary, non-html filetypes Multimedia Software Password-protected sites Sites prohibiting crawlers (robots.txt exclusion) Dynamically-created Web pages

Created at the moment of the query using the most recent version of the database. Database-driven Require interaction Amazon.com What titles are available? At what price? Are there recent reviews? What about shipping?

Used widely in e-commerce, news, statistical and other time-sensitive sites. Dynamically-created Web pages Why cant crawlers download them? Technically they can interact, within limits of programming capability Very costly and time-consuming for general search services

Dynamically-created Web pages How can a crawler detect a dynamically-created page? From any of the following in the URL ? , % , $ , = , ASP , PHP , CFM and others proquest.umi.com/pqdweb? Did=000000209668731&Fmt=1&Deli=1&Mtd=1&Idx =5&Sid=1&RQT=309

Proprietary Filetypes PDF Spreadsheets Word-processed documents Google does it! Why cant you?

Googles Deep Web Components: Non-html filetypes (1.75%) SEARCH SYNTAX california power shortage filetype:pdf Adobe Portable Document Format (pdf) Microsoft Excel (xls) Adobe PostScript (ps) Microsoft PowerPoint (ppt)

Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wk Microsoft Word (doc) Lotus WordPro (lwp) MacWrite (mw) Microsoft Works (wks, wps, wdb) Microsoft Write (wri) Rich Text Format (rtf) Text (ans, txt)

Google Non-html Filetypes Warning! FOR NON-HTML FILES Clicking on a title in the results list opens the application as well, involving risk of a virus or worm that may be attached to the file INSTEAD, click the View as HTML option; no applications will be opened and no risk of virus or worm NOTE: Titles for non-html files are

frequently not descriptive of content homeland security filetype:ppt Search Services Human created or influenced Directories general and specialized

Specialized search engines Subject metasites or gateways Deep Web gateways Search Services Human created or influenced

Content of sites is examined and categorized or crawling is human-focused and refined CAN include sites with dynamically created pages CAN be limited to database-driven sites (Deep Web) CAN include non-html files NOTE: Some specialized search engines may include little human influence eg. Search.edu The Topography of the Internet or The Layers of the Web

Mapping the web is challenging Unregulated in nature Influences from all over the globe Fulfills many purposes, from personal to commercial Changes rapidly and unexpectedly Divisions and terminology are inherently ambiguous eg. Deep vs Invisible Web

May I suggest a biological, nautical metaphor, perhaps the ocean? SURFACE WEB SHALLOW WEB OPAQUE WEB DEEP WEB Surface Web Static html documents

Crawler-accessible Shallow Web Static html documents loaded on servers that use ColdFusion or Lotus Domino or

other similar software A different URL for the same page is created each time it is served. Crawlers skip these to avoid multiple copies of the same page in their database Technically human accessible via directories, Deep Web gateways or links from other sites Opaque Web

Static html documents Technically crawler accessible 2 types: Downloaded and indexed by crawler Not downloaded or indexed by crawler Opaque Web Downloaded and indexed by crawler Buried in search results you never look at A casualty of relevance ranking

Not downloaded or indexed by crawler due to programmed download limits Document buried deep in the site Part of a large document that did not get downloaded (Typical crawl per page is 110 K or less) Document added since last crawler visit (Even the best revisit on an average of every 2 weeks, depending on amount of change at a site) Opaque Web

Access to the Opaque Web Specialized search engines General and specialized directories Subject metasites These services typically index more thoroughly and more often than large, general search engines Deep Web

Two Categories Technically inaccessible to crawlers Technically accessible to crawlers Deep Web Technically inaccessible to crawlers Dynamically created pages

Databases Non-textual files Password protected sites Sites prohibiting crawlers Deep Web Technically accessible to crawlers Textual files in non-html formats (Google does it!) Pages excluded from crawler

by editorial policy or bias Mining the Deep Web Techniques and Tips How large is the Deep Web? White Paper by Michael K. Bergman published in the Journal of Electronic Publishing in 2000. http://www.brightplanet.com/ deepcontent/ tutorials/DeepWeb/index.asp

Currently a scarcity of unbiased research due to its fluid nature, dynamic content and multiple points of access How large is the Deep Web? Bergman Study

Over 150,000 databases Over 95% publicly available Perhaps 500 times larger than the Surface Web Growth rate currently greater than the Surface Web Whats in the Deep Web? Information likely to be stored in a database

People, address, phone number locators Patents Laws Dictionary definitions Items for sale or auction Technical reports Other specialized data Whats in the Deep Web? Information that is new and dynamically changing

News Job postings Travel schedules and prices Financial data Library catalogs and databases Topical coverage is extremely varied. Mining the Deep Web A world different from search engines . . . Hunters Maxim for Searching the Deep

Web Plan to first locate the category of information you want, then browse. Dont be too specific in your searches. Cast a wide net. Brush up on your Gopher-type search skills (if you were searching the Net back then). Weve become accustomed to search engine free-text searching. This is a different world. Basic Strategies for Mining the Deep Web Using directories, general and specialized Using general search engines

Using specialized (subject-focused) search engines Using subject metasites (link-oriented) Using Deep Web gateway sites (databaseoriented) NOTE: Many sites contain elements of all of the above, in varying degrees and combinations Using directories

Yahoo! > web directories > 840 category matches Yahoo! > database > 22 categories and 7423 site matches Google Directory > link collections > 493,000 Databases may also be found under general subject categories Also use research directories such as

Infomine, LII, WWWVL and others Using general search engines Combine subject terms with one or more of these possibilities: directory crawler search engine database webring or web ring link collection blog

Using general search engines Google (11/4/02) toxic chemicals database > 45 punk rock search engine > 77 science fiction webring > 97 (web rings are cooperative subject metasites, maintained by experts or aficionados) Remember, when using a search engine you

must match words on the page. Using specialized (subjectfocused) search engines AKA Limited-area engines Targeted search engines Expert search services Vertical Portals Vortals Using specialized (subjectfocused) search engines

Non-html textual files http://searchpdf.adobe.com/ Google Non-textual files Image, MP3 search engines Media search at Google, et. al.

Software Blogs Blogdex http://blogdex.media.mit.edu/ Web logs or blogs Online personal journals Postings are often centered around a

particular topic or issue and may contain links to recent relevant information Frequently updated Differ from newsgroups in that they are generally by one author Web logs or blogs How do you search them? Blogdex http://blogdex.media.mit.edu Open Directory http://dmoz.org Computers / Internet / On the Web / Weblogs

Are they part of the Deep Web? Yes and No Web logs or blogs Google (5/23/02 and 11/4/02) allinurl:blogspot 171,000 | 301,000 53% mostly blog home pages allinurl:oxblog 2 | 39 1900% home page and 1 posting

FAST (5/23/02 and 11/4/02) URL:blogspot > 355,671 | 2,434,871 146% mostly blog home pages URL:oxblog > 0 | 5,510 Start your own at http://blogspot.com Using subject metasites

(link-oriented) Locate subject metasites via Directories Professional Organizations home pages Specialized search engine gateways (handout) Colleagues/Researchers Once into a subject metasite scan the page for search boxes and determine if they search the surface web of the site

only or embedded databases. (This is often not clearly indicated) Using Deep Web gateway sites (database-oriented) Become familiar with several (see handout) Most search only the home pages of the

databases they include. A few will actually enter your search terms and display results Explore their subject areas; some subjects may not be included at all. Deep Web gateways are still in an early stage of development, seeking broad appeal rather than a narrow focus. Using serendipity Sometimes the Deep Web comes to you!

Mine your bookmarks/favorites and add Deep Web resources when you come across them by chance. Evaluating Information from the Deep Web Evaluating Deep Web Information

Embedded databases Non-html textual files and password protected sites Non-textual files Software Embedded Databases

Typically targeted, focused information Content usually generated and used by knowledgeable parties Database creation and maintenance requires expertise and commitment Site location is usually stable Embedded Databases

Check author and/or sponsor Check for freshness Check for breadth or range of coverage Compare with other Deep Web sources offering similar information, especially for online shopping or other e-commerce uses. Non-html textual files and

password protected sites Evaluate as you would any other information from the Internet BEWARE: If using Google, open nonhtml textual files as html when possible. Opening the file and its application may transmit a virus. Image, audio, multimedia files

Check for image/audio quality Check for plug-in requirements Check for depth of coverage in the area of your query FEE or FREE??? Software

Check for sponsor/source/maintainer Is there a contact person? Check for freshness Latest versions available? Check for stability and reliability Has any virus scanning been done?

Check for breadth Are programs available for all operating systems? FEE or FREE??? Mining the Deep Web with Proprietary Software Directed Query Engines or Intelligent Agents

Designed to access distributed Deep Web resources Can be configured to search specific URLs Databases Subject metasites report collections dynamic pages online newsletters

Directed Query Engines or Intelligent Agents Several DQEs can be nested one query launches several others in a cascading fashion Publicly-available examples: PubMed Department of Energys Information Bridge NASAs Technical Report Server

Apples Sherlock (bundled with Mac OS 8.5 or higher) Searches Deep Web databases that you specify Directed Query Engines for purchase Simultaneous search of Deep Web and

other resources with many additional features Lexibot http://www.lexibot.com If you complete survey: $189 upgrades $15 If you dont: $289 upgrades $50 BullsEye http://info.intelliseek.com BullsEye Pro: 6 months $199 with free upgrades for

How does the Deep Web fit into my overall search strategy? What types of queries are wellsuited to the Deep Web? Information stored in databases One of many similar things Statistics, census data City, county, state, national and international public records, data and laws

Online reference books What types of queries are wellsuited to the Deep Web? Information that is new and dynamically changing News Pricing and availability of goods and services Financial data, national and international Job postings Travel schedules and pricing Library catalogs and databases

What types of queries are wellsuited to the Deep Web? Non-html textual files Non-textual files Software Searching blogs A few words from Sherman and Price

Authors of The Invisible Web Cyber Age Books, 2000 Datamine your Bookmark/Favorites Collection Explore reviewed sites thoroughly; They often contain Deep Web resources not mentioned by the reviewer

Subscribe to lists that are focused and relevant to your needs No main Deep Web list exists Resources appear in subject-based lists A few words from Sherman and Price Create your own monitoring service Identify Whats New pages and key sites you find valuable Use C4U to alert you to changes at

these sites. Gives you the type of change and keywords from the new text. Enables you to determine whether its worth checking or not Available FREE at http://www.c4u.com Remember Hunters Maxim for the Deep Web Plan to first locate the category of information you want, then browse. Dont be too specific in your searches. Cast a wide net.

Thank you and best of luck in discovering and taming this new Cyber Frontier!!! Michael Hunter Reference Librarian Warren Hunting Smith Library Hobart and William Smith Colleges Geneva, NY 14456 (315) 781-3552 [email protected]

