This work from MSR describes the architecture of a product synthesizer that could form the back end of any product search engine. The back end systems of these search engines have to extract/transform the product information contained in unstructured/semi-structured feeds into a form which is consistent with its custom schema. The structural diversity, churn and scale that consumer products provide turns the maintenance of a product catalog into a unique challenge for these product search engines.
At the outset, this system solves the problem by defining two stages in which the products for the catalog are synthesized.
Firstly the attributes of a product as found in the merchant’s feed are mapped to the attributes of the same product in the catalog. This process, referred to as attribute correspondence creation is a supervised learning technique based on a classifier that learns to associate or map attribute names from the prior experience of matching. The interesting thing to note here is the maintenance and evolution of the training set is automated based on some rules. This learning happens asynchronously/offline.
Next the structure of the incoming feed is reconciled with the catalog’s structure based on the attribute correspondences learnt. This is then followed by the clustering of reconciled offers and then deriving a single product out of each cluster.
A comprehensive product catalog is essential to the success of Product Search engines and shopping sites such as Yahoo! Shopping, Google Product Search, and Bing Shopping. Given the large number of products and the speed at which they are released to the market, keeping catalogs up-to-date becomes a challenging task, calling for the need of automated techniques. In this paper, we introduce the problem of product synthesis, a key component of catalog creation and maintenance. Given a set of offers advertised by merchants, the goal is to identify new products and add them to the catalog, together with their (structured) attributes. A fundamental challenge in product synthesis is the scale of the problem. A Product Search engine receives data from thousands of merchants about millions of products; the product taxonomy contains thousands of categories, where each category has a different schema; and merchants use representations for products that are different from the ones used in the catalog of the Product Search engine. We propose a system that provides an end-to-end solution to the product synthesis problem, and addresses issues involved in data extraction from offers, schema reconciliation, and data fusion. For the schema reconciliation component, we developed a novel and scalable technique for schema matching which leverages knowledge about previously-known instance-level associations between offers and products; and it is trained using automatically created training sets (no manually-labeled data is needed). We present an experimental evaluation using data from Bing Shopping for more than 800K offers, a thousand merchants, and 400 categories. The evaluation conï¬rms that our approach is able to automatically generate a large number of accurate product speciï¬cations. Furthermore, the evaluation shows that our schema reconciliation component outperforms state-of-the-art schema matching techniques in terms of precision and recall.
Previewing from http://arxiv.org/pdf/1105.4251