Loading Events

« All Events

Hugues Annoye (UCLouvain Saint-Louis)

April 3 @ 13:00 - 14:00

Title : BEAMM project : How do we deal with data ? Statistical matching and WGAN generation.

Abstract :

In the framework of the BEAMM project (BElgian Arithmetic Micro-simulation Model), we propose several methods to address data issues. The core of this project is to develop a tax-benefit microsimulation model for Belgium accessible online, requiring intensive data handling. Our challenges consist in creating a unified data set containing variables from different surveys and developing a completely synthetic database for the online development of the BEAMM platform.

Indeed, in the BEAMM context, we use a large number of variables available in different databases. We thus need to analyze data from different sources; the obser- vations, which only share a subset of the variables, cannot always be paired to detect common individuals. This is the case, for example, when the information required to study a certain phenomenon comes from different sample surveys. Statistical matching is a common practice to combine these data sets. In this talk, we investigate and extend to statistical matching three methods based on Kernel Canonical Correlation Analy- sis (KCCA; [2]), Super-Organizing Map (Super-OM; [1]) and Autoencoders-Canonical Correlation Analysis (A-CCA; [3]). These methods are designed to deal with various variable types, sampling weights and incompatibilities among categorical variables.

In our context, data privacy and anonymization are important. Under these cir- cumstances, the need for synthetic databases that replicate the characteristics of the population while preserving privacy is arising. In this presentation, we also investigate how we can employ a range of data generation approaches utilizing various advance- ments in the Wasserstein Generative Adversarial Network (WGAN) literature to create survey databases. WGANs were introduced by Arjovsky 2017 [4] in the context of im- age synthesis. Our algorithms have been adjusted to account for sampling weights. Moreover, survey and adminstrative data have the specificity of mixing continuous and categorical data, which should be taken into account in the architecture of the WGANs.

References :

[1] Kohonen, T. (1982), Self-organized formation of topologically correct feature map. Biological Cybernetics, 43 (1), 59–69.

[2] Lai, P. L. and Fyfe, C. (2000), Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems, 10 (05), 365–377.

[3] Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986), Learning Internal Representations by Error Propagation in Parallel Distributed Processing: Explo- rations in the Microstructure of Cognition. Cambridge: MIT Press, 318–362.

[4] Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks.In International conference on machine learning (pp. 214- 223). PMLR.

The seminar will take place in Room S08 at the Faculty of Sciences.

Details

Date:
April 3
Time:
13:00 - 14:00
Event Category: