Journal of Environmental Accounting and Management
Variable Selection Strategy Using Random Forests Algorithm to Identify the Effects of Environmental Factors on Health, Modeling from a GIS Multidimensional Dataset
Journal of Environmental Accounting and Management 3(2) (2015) 89--108 | DOI:10.5890/JEAM.2015.06.002
Stéphane Bourrelly
Aix Marseille Université. UMR-7300-ESPACE (CNRS); 98 Bd Edouard Herriot 06204 Nice cedex 3, France
Download Full Text PDF
Abstract
This paper proposes the method: MyVsurfGeo (MVG), designed to assess the adverse effects of living environments; of key interest in France’s cancer plans. Increased access to numerous databases enables modeling physicochemical, sanitary and socio-economic features at territorial scale, from Geographic Information Systems (GIS). However, GIS are not suited for characterizing relationships in existing high multidimensional datasets. Especially, when incorporating qualitative and quantitative indicators, with different accuracy levels. A recent strategy of variable selection using Random Forests provides the power to overcome this drawback. MVG1 method transposes this strategy into spatial analysis. It is applied to secondary tumors (TUM2) developed during a childhood leukemia remission. Results highlight health determinants and contributory factors that explain TUM2 incidences. The significance of MVG and its findings are discussed. The expected medical and political contributions are described.
Acknowledgments
I would like to thank Prof. C. Voiron for his help in designing the weighting system from fuzzy set theory, Prof. P. Auquier, for funding this research and helping me avoid many mistakes, Associate professor R. Genuer, for his technical support during the programmation of MVG algorithm, and my friend, T. Corazao, for his integral language review.
References
-
[1]  | Abramson, J.H. and Abramson, Z.H. (1988). Making Sense of Data: A Self-Instruction Manual on the Interpretation of Epidemiological Data. Oxford: Oxford University Press. |
-
[2]  | Afsset. (2009). Cancer et Environnement. Maisons-Alfort, Agence Française de Sécurité Sanitaire de l'Environnement et du Travail: Paris (Afsset). |
-
[3]  | Amin, R., Bohnert, A., Holmes, L., Rajasekaran, A. and Assanasen, C. (2010). Epidemiologic mapping of Florida childhood cancer clusters. Pediatric Blood Cancer 54(4): 511-518. |
-
[4]  | ASN. (2010). La surveillance de la radioactivité de l'environnement. Controle : La revue de l'ASN(188). |
-
[5]  | ASN. (2011). Décision n° 2011-DC-0204 de l’Autorité de Sûreté Nucléaire du 4 janvier 2011: Liste des Installations Nucléaires de Base, au 31 décembre 2010. |
-
[6]  | Barlet, M., Coldefy, M., Collin, C. and Lucas-Gabrielli, V. (2012). L'accessibilité potentielle localisée (APL): une nouvelle mesure de l'accessibilité aux médecins généralistes libéraux. Questions d'Economie de la Santé 174: 1-8. |
-
[7]  | Becker, N., Toedt, G., Lichter, P. and Benner, A. (2011). Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data. BMC Bioinformatics 12(138): 1-13. |
-
[8]  | Berbis, J., Michel, G., Baruchel, A., Bertrand, Y., Chastagner, P., Demeocq, F., Kanold, J., Leverger, G., Plantaz, D., Poirée, M., Stephan, J.L., Auquier, P., Contet, A., Dalle, J.H., Ducassou, S., Gandemer, V., Lutz, P., Sirvent, N., Tabone, M.D. and Thouvenin-Doulet, S. (2014). Cohort Profile: The French Childhood Cancer Survivor Study For Leukaemia (LEA ). International Journal of Epidemiology, pii: dyu031 [Epub ahead of print] PubMed 24639445. |
-
[9]  | Bernard, P.M. and Lapointe, C. (2003). Mesures statistiques en épidémiologie. Québec: Presse de l'Université du Québec. |
-
[10]  | Bourrelly, S. (2014a). Methodological proposition an application of EstimGRE algorithm to epidemiological data. In: Murgante B. et al, eds. Computational Science and Its Application. ICCSA 2014 14th, vol.8582, Portugal: Spingers, 227-242. |
-
[11]  | Bourrelly, S. (2014b). Modélisation et identification de facteurs environnementaux géographiques liés à des risques morbides. Application aux données de la cohorte LEA. UMR 7300 (ESPACE) - CNRS. Nice: Université de Nice Sophia Antipolis (UNS). |
-
[12]  | Breiman, L. and Cutler, A. (2005). Random Forests. In: Berkeley Free Software Foundation. University of California. |
-
[13]  | Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification And Regression Trees (new ed). New-York, U.S.A: Chapman & Hall/CRC. |
-
[14]  | Caudeville, J., Bonnard, R., Boudet, C., Denys, S., Govaert, G. and Cicolella, A. (2012). Developpement of spatial stochastic multimedia exposure model to assess population exposire at a region scale. Science of the Total Environment 432: 297-308. |
-
[15]  | CGDD and SOeS. (2009). CORINE Land Cover France: Guide d'utilisation. Paris: Ministère de l'écologie, de l'énergie, du développement durable et de l'aménagement du territoire. |
-
[16]  | Chaix, B., Merlo, J. and Chauvin, P. (2005). Comparison of spatial approach with the multilevel approach for investigating place effects on health: the example of health care utilisation in France. Journal of Epidemiology and Community Health 59(6): 517-526. |
-
[17]  | Coldefy, M., Com-Ruelle, L., Lucas-Gabrielli, V. and Marcoux, L. (2011). Les distances d'accès aux soins en France Métropolitaine au 1er janvier 2007. Irdes: Paris. |
-
[18]  | Comte-Sponville, A. (2013). Dictionnaire philosophique (éd. 4ème). Presses Universitaires de France: Paris. |
-
[19]  | DasGupta, A. (2008). Asymptotic Theory of Statistics and Probability (éd. Springer Texts in Statistics, XXVIII). Springer: Heidelberg. |
-
[20]  | Dubois, D. and Prade, H. (2004). On the use of aggregation operations in information fusion processes. Fuzzy Sets and Systems (142), 143-161. |
-
[21]  | ESRI. (2013). ArcGIS Resource Center: A comprehensive system for working with maps and geographic information. Website: http://resources.arcgis.com/ |
-
[22]  | Fotheringham, S., Brunsdon, C. and Charlton, M. (2000). Quantitative geography: perspectives on spatial data analysis. London: Sage Publications. |
-
[23]  | Friedman, J.H. (1991), The Annals of statistics 19(11): 1-67. |
-
[24]  | Fromageot, A. Coppieters, Y., Parent, F. and Lagasse, R. (2005). Epidémiologie et géographie: une interdisciplinarité à développer pour l'analyse des relations entre santé et environnement. Environnement, Risques & Santé 4(6): 395-403. |
-
[25]  | Furtos J. (2007). Les effets cliniques de la souffrance psychologique d'origine sociale. Mental'idées, 11, 24-33. |
-
[26]  | Genuer R., Poggi J.-M. and Tuleau-Malot C. (2010). Variable selection using random forest. (ELSEVIER, éd.) Pattern Recognition Letters, 31(14), 2225-2236. |
-
[27]  | Genuer R., Poggi J.-M. and Tuleau-Malot C. (2013) VSURF: Variable Selection Using Random Forests. Available on Internet: http://cran.r-project.org/web/packages/VSURF/index.html |
-
[28]  | Ghattas B. and Ben Ishak A. (2008). Sélection de variables pour la classification binaire en grande dimension : comparaisons et application aux données de Biopuces. Journal de la Société Française de Statistique, tome 149, n°3. Paris, France: Journal de la Société Française de la Statistique. |
-
[29]  | Ghosh A., Sharma R. and Joshi P.-K. (2014). Random forest classification of urban landscape using Landsat archive and ancillary data: Combining seasonal maps with decision level fusion. Applied Geography, 48(2014), 31-41. |
-
[30]  | Henke J. and Petropoulos G. (2013). A GIS-based exploration of the relationships between human health, social deprivation and ecosystem services: The case of Wales, UK. Applied Geography, 45(2013), 77-88. |
-
[31]  | IARC (2012). Review of Human Carcinogens Radiation. Monographs on the Evaluation of Carcinogenic Risks to Humans, 100(D), 341. |
-
[32]  | INCa (2009). Pan Cancer 2009-2013. Boulogne-Billancourt: Institut National du Cancer (INCa): Paris. |
-
[33]  | INSEE (2012). Insitutut National de la Statistique et des Etudes Economiques. Website: www.insee.fr |
-
[34]  | Institute for Statistics and Mathematics. (1997). The Comprehensive R Archive Network. Website: www.cran.r-project.org. |
-
[35]  | IRSN (2001). Que faut-il savoir sur le Radon, Institut de Radioprotection et de S?reté Nucléaire (IRSN). Web site: www.irsn.fr. |
-
[36]  | Leux, C. and Guénel, P. (2010). Risk factors of thyroid tumors: role of environmental and occupational exposures to chemical pollutants. Revue d'épidémiologie et de santé publique 58(5): 359-367. |
-
[37]  | Li, F., Yang, Y. and Xing, E.P. (2005). From Lasso regression to Feature vector Processing Systems. Dans B. S. Yair Weiss (éd.), Advances in Neural Information Processing Systems 18 (pp. 779–786). Yair Weiss: Vancouver. |
-
[38]  | Lütkepohl, H. (1991). Introduction to multiple time series analysis (éd. 2nd). (U. d. Michigan, éd.) Springer-Verlag: Michigan. |
-
[39]  | Météo-France. (2011). Note sur les produits Meteo-France: Documentation techniques. Website: www.meteofrance.com |
-
[40]  | Michel, G., Bordigoni, P., Simeoni, M.C., Curtillet, C., Hoxha, S., Robitail, S., Thuret, I., Pall-Kondolff, P., Chambost, H., Orbicini, D. and Auquier, P. (2007). Health status and quality of life in long-term survivors of childhood leukaemia: the impact of haematopoietic stem cell transplantation. Bone Marrow Transplantation 40(9): 897-904. |
-
[41]  | ONDRP. (2011). Méthodologie des tableaux de bords annuels de l'ONDRP. National Observatory of Delinquency and Penal Responses (ONDRP): Paris. |
-
[42]  | Penchansky, R. and Thomas, J.W. (1981). The Concept of Access: definition and relationship to consumer satisfaction. Medical Care 19(2): 127-140. |
-
[43]  | Powell, M. (1995). On the outside looking in: medical geography, medical geographers and access to health care. Health and Place 1(1): 41-50. |
-
[44]  | Raper, J. (2000). Multidimensional Geographic Information Science (éd. 2nd, 2005). Taylor & Francis: London. |
-
[45]  | Rey, G., Jougla, E., Fouillet, A. and Hémon, D. (2009). Ecological association between a deprivation index and mortality in France over the period 1997–2001: variations with spatial scale, degree of urbanicity, age, gender and cause of death. BMC Public Health 9(33): 1-12. |
-
[46]  | Salem, G. (1995). Géographie de la santé, santé de la géographie. Espace Populations Sociétés 1: 25-30. |
-
[47]  | Rothman, K. and Greenland, S. (1998). Modern Epidemiology (éd. 2). Lippincott-Raven: New-York. |
-
[48]  | Wackernagel, H. (2003). Multivariate Geostatistics: An Introduction with Applications (éd. 3rd). Springer: Berlin. |
-
[49]  | Weston, J., Ellisseff, A., Schoelkopf, B. and Tipping, M. (2003), Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research 13: 1439-1461. |
-
[50]  | Zeitouni, K. (2006). Analyse et extraction de connaissances des bases de données spatio-temporelles. Université de Versailles-Saint Quentin en Yvelines: Paris. |