After such an intense day of note taking and a very entertaining evening bowling with the guys from ILRT on day 1, I didn't manage to take proper notes on yesterday's sessions. I'll try to fill in some detail during today.
9.00 Extrapolation Methods for Accelerating PageRank Computations
The much-slashdotted algorithm for improving the sort of calculations google does. The speaker notes that this algorithm by itself only gives 30-50% improvement; it needs combining with other techniques. He also notes that this is not "accelerating google", it's research work.
09.30 P2Cast: Peer-to-peer Patching Scheme for VoD Service
Using bittorrent-like peer to peer connections for distributing video content, but in a way that allows for streaming.
11.00 W3C Standards for Web Services update
Web Services Architecture, Web Services Description Language 1.2, SOAP 1.2, Internationalization and Web Services
Not much to say - just reports on activity so far, very uninspiring. Like being read company end-of-year reports.
14.00 Plenaries
"Leonardo's Laptop: Human Needs and the New Computing Technologies", largely content-free and hyperbolic
"Globalities, Spatialities and it Strategies: New Web Politics and Social Development", largely incomprehensible. Looked like the slides had some interesting stats, although some of the graphs looked like a game of missile command.
16.00 Application Specific Data Replication for Edge Services
Interesting ideas on exploiting your particular application's usage patterns when distributing data to edge servers (akamai-style). For example, ad servers don't need the exact inventory of ads, they just need an ad that matches adequate criteria for display (if your inventory is large enough).
I'm at www2003 in Budapest, Hungary for the rest of the week. I'm not much of a note-taker but I'll probably jot a few things down here. Good wireless coverage in the conference venue. See also day 2.
09.00 Breakfast, registration, Hungarian Informatics and Communications minister, violinists with cheesy thumping synth backing track.
09.30 Tim Berners-Lee takes the stage. How are Web Services and Semantic Web related? A story of program and data as old as computing. Theme: "The data in our lives is a web", the RDF model and URIs as identifiers are descriptive, robust and reusable. Web Services can use these for discovery and description, RDF can be sent as a SOAP payload, etc. Emphasis on Enterprise Integration (euw). Ends with "OH MY GOD! THE ENTIRE WORLD IS ON MY LAPTOP!" really loud.
Information Retrieval
11.00 Information Retrieval session, starting with "Query-Free News Search" from Google.
(Bah, I find a powersocket and then the room loses power to its plugstrips. Puzzled people jiggle their cables all over the room.)
Generating related news by scraping TV closed-captions. "Issue query every s seconds" based on term extraction from recent text segment, data structure from previous text. Rare words given higher weight (Inverse Document Frequency squared). Use stemming, but not Porter algorithm or similar - just take first 5 chars of word. Build a stem vector over time, adding information if new text segment seems similar to old, building new vectors if the text segment is significantly different. Try to issue 3-word queries - 1-word too vague, fall back to 2-word if necessary.
Postprocessing has steps of boosting based on top terms and similarity to given text, then filter to top 2 results.
data sets: 3 days of 30 mins CNN Headline News, 90 minutes of other CNN programming
Search precision is improved by issuing queries based on longer (15 second) segments of text. Boosting and filtering steps make big (0.6 versus 0.9) improvement.
They define recall for the web (pooled recall) as (# of relevant pages) / (total # of relevant pages found by any Google algorithm) because we can't know actual recall on the web. Introductory material on recall+precision.
If we're going to pick only top 2 results, discarding one top result when it is very similar to the other, then picking another top result, is very effective.
Open problems for future work: how to combine algorithms? Weighted term-vector retrieval? Extend to other genres, not just news. Apply to other text streams, eg conversations.
This is google research, not deployed.
11.30 "Improving pseudo-relevance feedback in web information retrieval using webpage segmentation" presented on video. Microsoft Research Asia. SARS prevents attendance.
Separating out bits of webpages (nav, branding, content, multiple topics within content) for IR. (Non-native English speaker and noisy audio make understanding a bit of a struggle.) "Goal: construct a vision-based content structure for a web page". Previous DOM-based work "does not necessarily reflect semantic partition". VIPS algorithm tries to extract visual blocks and find separators: horiz/vert lines, space with no blocks crossing. (Video occasionally jerky; presenter appears as-if-by-magic pointing at slides from time to time, Mr Benn style.) This extraction to be done after an initial traditional IR search to iterate and improve results by going down to sub-page segment level to measure quality of documents from initial resultset.
Experiments used TREC wt10g corpus (non-free dataset).
12.00 "Predictive Caching and Prefetching of Search Results in Web Search Engines" IBM Research
Exploiting locality of reference in query streams faced by search engines: popularity of searches (zeitgeist), multiple result pages per query.
Based on stats extracted from query logs (7160190 queries to altavista in sept 2001). A query is a triplet (topic/searchphrase,firstpagerequested,lastpagerequested). 97.7% of queries just requested the first page of results. 7160190 queries requested 7549873 result pages. 67% of the 2657410 topics were only requested once; most popular topic requested 31546 times. (Presumably zipf curve.) Yes, follows power-law distribution.
Caching schemes: LRU. Segmented LRU - protected and probationary sgements with separate expiry. Pages requested twice get moved into protected segment. Previous work suggests SLRU outperforms LRU until the cache gets large, when there's not much difference between the two algos. PDC - probability-driven cache: simple model of a search session that helps pre-fetching by predicting user's next hit (most likely being the next page in the resultset for an existing topic query).
Phew, lunchtime.
Semantic Web panel
14:00 "The Semantic Web: Scientific American article considered harmful?" Jim Hendler, Amit Sheth, Jane Hunter, Nigel Shadbolt, Ramanathan Guha panel session
The article in question
Jim Hendler's slides - he's up first, gives a quick demo of running system showing the use case from the article. Now talking semweb advocacy.
This is fairly free-form and I'm squeezed into a corner - there's some scribing of the session being logged from #www2003.
Jane Hunter: "semantic weblets as opposed to the semantic web"
Nigel Shadbolt - has a large information base gathered from computer science community using a CS ontology. Nice GUI with british isles overlaid with location points. Uses homegrown RDQL. Realtime mining of things like funding databases. "scruffy works" - you don't need to design everything cleanly upfront.
(ouch, gecko textbox is really slowing down on this long entry. webforms suck for text editing. i miss vim. just restarted epiphany, speed is fine again)
"Annotation bottleneck", no "one size fits all" ontologies (obviously)
Guha - if people mark up manually, we'll get a few hundred pages on the semantic web. The way forward is for people to export the knowledge in their databases in open ways (xml, rdf, whatever).
Question time:
audience: "blogs are simple, people are just monkey-like imitators when using technology. How can the semantic web be easy enough for people to participate in?"
eric miller: "i see blogging as being an example of a semantic weblet, being able to tie together trackbacks, foaf, etc", semantic web is "building off a simple notion: directed labeled graphs ... this made the web happen"
guha: "xml gives you a syntax. the world is about objects and relations between them. we got to RDF by taking KR and chopping it down to minimal size. the next step is to come up with enough applications, that demands more and more complexity."
jim hendler: "It's not semantics on the web, it's the web of semantics"
audience question: "why does there have to be a backlash? how long did relational databases take to get established? the www?"
bijan: "it's going too fast. the sciam article should be criticised for speeding things up. rapid information flow makes systems more vulnerable. fast capital flows are devastating on local communities. will people understand the implications of policy decisions made with regard to this technology? eg TIA systems"
amit: "every technology has good and bad, eg crypto"
jim: "things have to co-evolve: the google pagerank algorithm wouldn't have worked in the early years of the web because of the nature and amount of linking. the altavista algorithm worked back then and lost relevance over time."
guha: "people said the web wouldn't working, lacking security, caching, etc. The architecture of the web allowed those things to be evolved later."
Semantic web afternoon session
16.00 "SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation", IBM
Talk will: propose an algorithm to run over the web to create automatic semantic tagging, present the output, describe the analysis platform
Using TAP knowledgebase
Processed 1billion pages in a crawl, 250 million get TAP annotations (using about 50,000 TAP terms)
Cluster of 128 machines runs 50,000 pages per second
This morning I did my talk (A Semantic Web Shoebox - Annotating Photos with RSS and RDF) at XMLEurope 2003. The slides are now available.
The talk was scheduled amongst a fun-packed Semantic Web series of Celia Romaniuk (Soap Operas and the Semantic Web), Uche Ogbuji (Akara - Part Wiki, Part Blog, Powered by XML and RDF), Jo Walsh (Collaborative Mapping with RDF) and myself.