www2003 day 1 :: Hackdiary

www2003 day 1

May 21st, 2003 | Published in events | 2 Comments

I’m at www2003 in Budapest, Hungary for the rest of the week. I’m not much of a note-taker but I’ll probably jot a few things down here. Good wireless coverage in the conference venue. See also day 2.

09.00 Breakfast, registration, Hungarian Informatics and Communications minister, violinists with cheesy thumping synth backing track.

09.30 Tim Berners-Lee takes the stage. How are Web Services and Semantic Web related? A story of program and data as old as computing. Theme: “The data in our lives is a web”, the RDF model and URIs as identifiers are descriptive, robust and reusable. Web Services can use these for discovery and description, RDF can be sent as a SOAP payload, etc. Emphasis on Enterprise Integration (euw). Ends with “OH MY GOD! THE ENTIRE WORLD IS ON MY LAPTOP!” really loud.

Information Retrieval

11.00 Information Retrieval session, starting with “Query-Free News Search” from Google.
(Bah, I find a powersocket and then the room loses power to its plugstrips. Puzzled people jiggle their cables all over the room.)
Generating related news by scraping TV closed-captions. “Issue query every s seconds” based on term extraction from recent text segment, data structure from previous text. Rare words given higher weight (Inverse Document Frequency squared). Use stemming, but not Porter algorithm or similar – just take first 5 chars of word. Build a stem vector over time, adding information if new text segment seems similar to old, building new vectors if the text segment is significantly different. Try to issue 3-word queries – 1-word too vague, fall back to 2-word if necessary.
Postprocessing has steps of boosting based on top terms and similarity to given text, then filter to top 2 results.
data sets: 3 days of 30 mins CNN Headline News, 90 minutes of other CNN programming
Search precision is improved by issuing queries based on longer (15 second) segments of text. Boosting and filtering steps make big (0.6 versus 0.9) improvement.
They define recall for the web (pooled recall) as (# of relevant pages) / (total # of relevant pages found by any Google algorithm) because we can’t know actual recall on the web. Introductory material on recall+precision.
If we’re going to pick only top 2 results, discarding one top result when it is very similar to the other, then picking another top result, is very effective.
Open problems for future work: how to combine algorithms? Weighted term-vector retrieval? Extend to other genres, not just news. Apply to other text streams, eg conversations.
This is google research, not deployed.

11.30 “Improving pseudo-relevance feedback in web information retrieval using webpage segmentation” presented on video. Microsoft Research Asia. SARS prevents attendance.
Separating out bits of webpages (nav, branding, content, multiple topics within content) for IR. (Non-native English speaker and noisy audio make understanding a bit of a struggle.) “Goal: construct a vision-based content structure for a web page”. Previous DOM-based work “does not necessarily reflect semantic partition”. VIPS algorithm tries to extract visual blocks and find separators: horiz/vert lines, space with no blocks crossing. (Video occasionally jerky; presenter appears as-if-by-magic pointing at slides from time to time, Mr Benn style.) This extraction to be done after an initial traditional IR search to iterate and improve results by going down to sub-page segment level to measure quality of documents from initial resultset.
Experiments used TREC wt10g corpus (non-free dataset).

12.00 “Predictive Caching and Prefetching of Search Results in Web Search Engines” IBM Research
Exploiting locality of reference in query streams faced by search engines: popularity of searches (zeitgeist), multiple result pages per query.
Based on stats extracted from query logs (7160190 queries to altavista in sept 2001). A query is a triplet (topic/searchphrase,firstpagerequested,lastpagerequested). 97.7% of queries just requested the first page of results. 7160190 queries requested 7549873 result pages. 67% of the 2657410 topics were only requested once; most popular topic requested 31546 times. (Presumably zipf curve.) Yes, follows power-law distribution.
Caching schemes: LRU. Segmented LRU – protected and probationary sgements with separate expiry. Pages requested twice get moved into protected segment. Previous work suggests SLRU outperforms LRU until the cache gets large, when there’s not much difference between the two algos. PDC – probability-driven cache: simple model of a search session that helps pre-fetching by predicting user’s next hit (most likely being the next page in the resultset for an existing topic query).

Phew, lunchtime.

Semantic Web panel

14:00 “The Semantic Web: Scientific American article considered harmful?” Jim Hendler, Amit Sheth, Jane Hunter, Nigel Shadbolt, Ramanathan Guha panel session
The article in question
Jim Hendler’s slides – he’s up first, gives a quick demo of running system showing the use case from the article. Now talking semweb advocacy.
This is fairly free-form and I’m squeezed into a corner – there’s some scribing of the session being logged from #www2003.
Jane Hunter: “semantic weblets as opposed to the semantic web”
Nigel Shadbolt – has a large information base gathered from computer science community using a CS ontology. Nice GUI with british isles overlaid with location points. Uses homegrown RDQL. Realtime mining of things like funding databases. “scruffy works” – you don’t need to design everything cleanly upfront.
(ouch, gecko textbox is really slowing down on this long entry. webforms suck for text editing. i miss vim. just restarted epiphany, speed is fine again)
“Annotation bottleneck”, no “one size fits all” ontologies (obviously)
Guha – if people mark up manually, we’ll get a few hundred pages on the semantic web. The way forward is for people to export the knowledge in their databases in open ways (xml, rdf, whatever).
Question time:
audience: “blogs are simple, people are just monkey-like imitators when using technology. How can the semantic web be easy enough for people to participate in?”
eric miller: “i see blogging as being an example of a semantic weblet, being able to tie together trackbacks, foaf, etc”, semantic web is “building off a simple notion: directed labeled graphs … this made the web happen”
guha: “xml gives you a syntax. the world is about objects and relations between them. we got to RDF by taking KR and chopping it down to minimal size. the next step is to come up with enough applications, that demands more and more complexity.”
jim hendler: “It’s not semantics on the web, it’s the web of semantics”
audience question: “why does there have to be a backlash? how long did relational databases take to get established? the www?”
bijan: “it’s going too fast. the sciam article should be criticised for speeding things up. rapid information flow makes systems more vulnerable. fast capital flows are devastating on local communities. will people understand the implications of policy decisions made with regard to this technology? eg TIA systems”
amit: “every technology has good and bad, eg crypto”
jim: “things have to co-evolve: the google pagerank algorithm wouldn’t have worked in the early years of the web because of the nature and amount of linking. the altavista algorithm worked back then and lost relevance over time.”
guha: “people said the web wouldn’t working, lacking security, caching, etc. The architecture of the web allowed those things to be evolved later.”

Semantic web afternoon session

16.00 “SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation”, IBM
Talk will: propose an algorithm to run over the web to create automatic semantic tagging, present the output, describe the analysis platform
Using TAP knowledgebase
Processed 1billion pages in a crawl, 250 million get TAP annotations (using about 50,000 TAP terms)
Cluster of 128 machines runs 50,000 pages per second

Responses

Feed

Ronny Lempel says:

May 24th, 2003 at 3:26 pm (#)

Actually, in “Predictive Caching and Prefetching of Search Results in Web Search Engines”, it was NOT claimed that “97.7% of queries just requested the first page of results”. That percentage was the number of users that requested results in batches of 10 results per page, as most Web users do by default. Those 10 results could have very well been the results ranking in places 11-20. Actually, the percentage of page views that are accounted for by the top pages is about 63.5%. By considering the almost 12% of page views for results 11-20, it can be estimated that about one of every five users requests more than one page of search results, and so about 80% of users view just the first page.

BTW, the affiliation of both authors at the time of writing the paper was Technion, Israel (not IBM Research).
redscouser says:

January 10th, 2004 at 12:17 pm (#)

Is Matt Biddulph, the gentleman from UK about 24?

Hackdiary

www2003 day 1

Responses

Archives