Fraunhofer IAIS, 2003.
At Fraunhofer, I worked on the DIASTASIS project, the grand, full title of which was:
Digital Era Statistical Indicators: Deﬁnition, measurement and exploitation of new socio-economic indicators by correlating Web usage statistical data and household research.
Essentially, the idea was that by carrying out a full demographic study of a small sample population, one could correlate attributes to a much larger population on the basis of a handful of indicators (shared by both populations) which are already available. This work was carried out by the Spanish partners in the project.
My role was the classiﬁcation of web browsing habits. I had a large dataset of URIs visited by study participants and had to devise some means of automatically classifying them into topics. Eg, visiting the BBC News website might be classiﬁed as ‘news’, visiting Real Madrid’s website might be classiﬁed as ‘sport’. Assigning such basic classiﬁcations is hugely problematic, but is better than nothing.
The approach I took was supervised learning using support vector machines (SVM). The initial training data was the DMOZ human-categorised list of websites. This gave us a mapping of topics to URIs. The next step is to download the content of each of the addresses, preprocess it and use it to train the classiﬁers. At this stage, it became clear that the actual machine-readable content of a page can be slight, such as image-heavy websites, or those that don’t offer content until some levels down in the navigation. As a result, pages were crawled at adeeper level and the HTML-markup used to weight text features. Other issues are that the content of pages change over time, so the content fetched at the time of crawling or classiﬁcation may not match what the user saw.