Clint Heyer pro­jects

Di­as­ta­sis

Clas­si­fy­ing web brows­ing in­ter­ests.
Fraun­hofer IAIS, 2003.

At Fraun­hofer, I worked on the DI­AS­TA­SIS pro­ject, the grand, full ti­tle of which was:

Dig­i­tal Era Sta­tis­ti­cal In­di­ca­tors: De­f­i­n­i­tion, mea­sure­ment and ex­ploita­tion of new so­cio-eco­nomic in­di­ca­tors by cor­re­lat­ing Web us­age sta­tis­ti­cal data and house­hold re­search.

Es­sen­tially, the idea was that by car­ry­ing out a full de­mo­graphic study of a small sam­ple pop­u­la­tion, one could cor­re­late at­trib­utes to a much larger pop­u­la­tion on the ba­sis of a hand­ful of in­di­ca­tors (shared by both pop­u­la­tions) which are al­ready avail­able. This work was car­ried out by the Span­ish part­ners in the pro­ject.

My role was the clas­si­fi­ca­tion of web brows­ing habits. I had a large dataset of URIs vis­ited by study par­tic­i­pants and had to de­vise some means of au­to­mat­i­cally clas­si­fy­ing them into top­ics. Eg, vis­it­ing the BBC News web­site might be clas­si­fied as news’, vis­it­ing Real Madrid’s web­site might be clas­si­fied as sport’. As­sign­ing such ba­sic clas­si­fi­ca­tions is hugely prob­lem­atic, but is bet­ter than noth­ing.

The ap­proach I took was su­per­vised learn­ing us­ing sup­port vec­tor ma­chines (SVM). The ini­tial train­ing data was the DMOZ hu­man-cat­e­gorised list of web­sites. This gave us a map­ping of top­ics to URIs. The next step is to down­load the con­tent of each of the ad­dresses, pre­process it and use it to train the clas­si­fiers. At this stage, it be­came clear that the ac­tual ma­chine-read­able con­tent of a page can be slight, such as im­age-heavy web­sites, or those that don’t of­fer con­tent un­til some lev­els down in the nav­i­ga­tion. As a re­sult, pages were crawled at adeeper level and the HTML-markup used to weight text fea­tures. Other is­sues are that the con­tent of pages change over time, so the con­tent fetched at the time of crawl­ing or clas­si­fi­ca­tion may not match what the user saw.