Cristian Consonni bio photo

Cristian Consonni

Ph.D. in Computer Science, free software activist, physicist and storyteller

Email Twitter Facebook LinkedIn Github Stackoverflow keybase


These are my publications (complete Curriculum Vitæ: en, it). See also my Google Scholar profile and ResearchGate profile.

While there is extensive literature both on the motivations of Wikipedia's editors and on newcomers' retention, less is known about the process by which experienced editors leave. In this paper, we present an approach to characterize Wikipedia's editor drop-off as the transitional states from activity to inactivity. Our approach is based on the data that can be collected or inferred about editors' activity within the project, namely their contributions to encyclopedic articles, discussions with other editors, and overall participation. Along with the characterization, we want to advance three main hypotheses, derived from the state of the art in the literature and the documentation produced by the community, to understand which interaction patterns may anticipate editors leaving Wikipedia: 1) abrupt interactions or conflict with other editors, 2) excess in the number and spread of interactions, and 3) a lack of interactions with editors with similar characteristics. We present this work both as a preliminary stage of our research to understand editor drop-off and as a flexible frame to look at the phenomenon that we believe can be useful in the future. Furthermore, by characterizing drop-off and identifying interaction patterns that may be associated with it, it may be possible to assess the general health of a community, and ultimately propose changes to improve it.
Mobility and transport, by their nature, involve crowds and require the coordination of multiple stakeholders - such as policy-makers, planners, transport operators, and the travelers themselves. However, traditional approaches have been focused on time savings, proposing to users solutions that include the shortest or fastest paths. We argue that this approach towards travel time value is not centered on a traveler's perspective. To date, very few works have mined data from crowds of travelers to test the efficacy and efficiency of novel mobility paradigms. In this paper, we build upon a different paradigm of worthwhile time in which travelers can use their travel time for other activities; we present a new dataset, which contains data about travelers and their journeys, collected from a dedicated mobile application. Each trip contains multi-faceted information: from the transport mode, through its evaluation, to the positive and negative experience factors. To showcase this new dataset's potential, we also present a use case, which compares corresponding trip legs with different transport modes, studying experience factors that negatively impact users using cycling and public transport as alternatives to cars. We conclude by discussing other application domains and research opportunities enabled by the dataset.
Surfing the links between Wikipedia articles constitutes a valuable way to acquire new knowledge related to a topic by exploring its connections to other pages. In this sense, Personalized PageRank is a well-known option to make sense of the graph of links between pages, and identify the most relevant articles with respect to a given one; its performance, however, is hindered by pages with high indegree that function as hubs and obtain high scores regardless of the starting point. In this work, we present CycleRank, a novel algorithm based on cyclic paths aimed at finding the most relevant nodes related to a topic. To compare the results of CycleRank with those of Personalized PageRank and other algorithms derived from it, we perform three experiments based on different ground truths. We find that CycleRank aligns better with readers’ behavior as it ranks in higher positions the articles corresponding to links that receive more clicks; it tends to identify in higher position related articles highlighted by editors in the “See also” section; and it is more robust to global hubs of the network having high indegree. Finally, we show that computing CycleRank is two orders of magnitude faster than computing the other baselines.
Understanding user mobility is central to develop better transport systems that answer users' needs. Users usually plan their travel according to their needs and preferences; however, different factors can influence their choices when traveling. In this work, we model users' preferences, and we match their actual transport use. We use data coming from a mobility platform developed for mobile devices, whose aim is to understand the value of users' travel time. Our first goal is to characterize the perception that users have of their mobility by analyzing their general preferences expressed **before** their travel time. Our approach combines dimensionality reduction and clustering techniques to provide interpretable profiles of users. Then, we perform the same task **after** monitoring users' travels by doing a matching between users' preferences and their actual behavior. Our results show that there are substantial differences between users' perception of their mobility and their actual behavior: users overestimate their preferences for specific mobility modes, that in general, yield a lower return in terms of the worthwhileness of their trip.
Surfing the links between Wikipedia articles constitutes a valuable way to acquire new knowledge related to a topic. The density of connections in Wikipedia makes that, starting from a single page, it is possible to reach virtually any other topic on the encyclopedia. This abundance highlights the need for dedicated algorithms to identify the topics which are more relevant to a given concept. In this context, a well-known algorithm is Personalized PageRank; its performance, however, is hindered by pages with high in-degree that function as hubs and appear with high scores regardless of the starting point. In this work, we present how a novel algorithm based on cyclic paths can be used to find the most relevant nodes in the Wikipedia link network related to a topic. We present a case study showing how the most relevant concepts associated with the topic of "Fake news" vary over time and across language editions.
Wikipedia articles contain multiple links connecting a subject to other pages of the encyclopedia. In Wikipedia parlance, these links are called internal links or wikilinks. We present a complete dataset of the network of internal Wikipedia links for the 9 largest language editions. The dataset contains yearly snapshots of the network and spans 17 years, from the creation of Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on the complete hyperlink graph which includes also links automatically generated by templates, we parsed each revision of each article to track links appearing in the main text. In this way we obtained a cleaner network, discarding more than half of the links and representing all and only the links intentionally added by editors. We describe in detail how the Wikipedia dumps have been processed and the challenges we have encountered, including the need to handle special pages such as redirects, i.e., alternative article titles. We present descriptive statistics of several snapshots of this network. Finally, we propose several research opportunities that can be explored using this new dataset.

A relevant task in the exploration and understanding of large datasets is the discovery of hidden relationships in the data. In particular, functional dependencies have received considerable attention in the past. However, there are other kinds of relationships that are significant both for understanding the data and for performing query optimization. Order dependencies belong to this category. An order dependency states that if a table is ordered on a list of attributes, then it is also ordered on another list of attributes. The discovery of order dependencies has been only recently studied. In this paper, we propose a novel approach for discovering order dependencies in a given dataset. Our approach leverages the observation that discovering order dependencies can be guided by the discovery of a more specific form of de- pendencies called order compatibility dependencies. We show that our algorithm outperforms existing approaches on real datasets. Furthermore, our algorithm can be parallelized leading to further improvements when it is executed on multiple threads. We present several experiments that illustrate the effectiveness and efficiency of our proposal and discuss our findings.

Modern cities use information and communication technologies to obtain deep insights on the different aspects of the way they operate, which can allow officials to make informed decisions to improve the operational efficiency of different operations and improve the life of their citizens. Analyzing the data about the different activities poses significant challenges. It is not merely the volume that recent hardware and software advancements have helped to achieve, but also challenges regarding the variety, velocity, and veracity of the data. All this is often known as the Big Data paradigm. In this document, we analyze some of these challenges, which we believe have not yet received considerable attention, we explain their value, and we describe some of the advanced solutions we have developed.

These works have been developed while I was at Università degli studi di Milano-Bicocca, at Fondazione Bruno Kessler or during projects done by Wikimedia Italia.

Flood and - in general - natural hazards cannot be prevented; however, measures can be taken to mitigate their impacts and prevent them from becoming disasters. Disaster management has been defined as «(the) continuous process that aims at avoiding or reducing the impact of natural hazards» (Poser, Dransch, 2010). Poser and Dransch (2010) have also outlined the importance of using up-to-date and accurate information in all phases of disaster management, as the need of integrating information from many different sources including in-situ sensors, aerial and satellite images, administrative, statistics and socioeconomic census data. New Internet technologies have facilitated fast and easy data collection from the public, giving rise to the idea of using Volunteered Geographic Information (VGI) in disaster risk management. The paper discusses the opportunities and challenges of using VGI for disaster management, with particular focus on information for the prevention phase. This case study is based on flood risk assessment in two recently flooded cities in Veneto, Italy. We used InaSAFE, a free hazard and risk modeling application integrated in QGIS as a plug-in. InaSAFE offers the capacity to compare hazard and exposure official data with community crowdsourced data. In the case study we compare the results obtained by InaSAFE when using as input the data describing buildings (as exposure layer) drew from OpenStreetMap and from official public data. The goal of this work is answering the following question: Can OSM be used to collect exposure data for DRM? The paper ends analyzing different data sources opportunities and limits.

We study the topological charge distribution of the SU(3) Yang-Mills theory with high precision in order to be able to detect deviations from Gaussianity. The computation is carried out on the lattice with high statistics Monte Carlo simulations by implementing a naive discretization of the topological charge evolved with the Yang-Mills gradient flow. This definition is far less demanding than the one suggested from Neuberger’s fermions and, as shown in this paper, in the continuum limit its cumulants coincide with those of the universal definition appearing in the chiral Ward identities. Thanks to the range of lattice volumes and spacings considered, we can extrapolate the results for the second and fourth cumulant of the topological charge distribution to the continuum limit with confidence by keeping finite volume effects negligible with respect to the statistical errors. Our best results for the topological susceptibility is $t_{0}^{2}\chi=6.67(7) \times 10^{-4}$, where $t_0$ is a standard reference scale, while for the ratio of the fourth cumulant over the second, we obtain $R=0.233(45)$. The latter is compatible with the expectations from the large $N_c$ expansion, while it rules out the $\theta$ behavior of the vacuum energy predicted by the dilute instanton model. Its large distance from 1 implies that, in the ensemble of gauge configurations that dominate the path integral, the fluctuations of the topological charge are of quantum nonperturbative nature.

Open Source initiatives in cultural environment are emerging tools among educational and cultural institutions. This practice, indeed, is largely used by museums, libraries, archives, in order to promote knowledge also implemented by a sharing process. Involving people in this participative process means also increase the number of real visitors in museums. That idea stimulated the project “Archeowiki”.

We present a precise computation of the topological charge distribution in the $SU(3)$ Yang-Mills theory. It is carried out on the lattice with high statistics Monte Carlo simulations by employing the clover discretization of the field strength tensor combined with the Yang–Mills gradient flow. The flow equations are integrated numerically by a fourth-order structure-preserving Runge–Kutta method. We have performed simulations at four lattice spacings and several lattice sizes to remove with confidence the systematic errors in the second (topological susceptibility $\chi_{t}^{\text{YM}}$ ) and the fourth cumulant of the distribution. In the continuum we obtain the preliminary results $t_{0}^{2} \chi_{t}^{\text{YM}}= 6.53(8) \times 10^{-4}$ and the ratio between the fourth and the second cumulant $R=0.233(45)$. Our results disfavour the $\theta$-behaviour of the vacuum energy predicted by dilute instanton models, while they are compatible with the expectation from the large-$N_{c}$ expansion.

  • Poster
    Cristian Consonni.
    “Nuts4Nuts: extraction of geospatial information from Wikipedia and linking to OpenStreetMap”
    Collective Intelligence 2014 @ Massachusetts Institute of Technology, Cambridge (MA) – June 10-12, 2014.
    poster (self-hosted) code on GitHub
Volunteered geographical information (VGI) are one facet of phenomenon of crowdsourcing in which people are collecting and sharing large amounts data in open and collaborative projects. Although these projects have different purposes and scopes there is some overlap between them so it can be asked if these data, which are collected from different communities with different processes, are coherent. In this context we have developed a tool, called Nuts4Nuts, which can identify the municipality in which a Wikipedia article is located extracting relevant informations from the templates or perfoming an analysis of the article's incipit. The code is available with a permissive MIT license. At the moment, the system is limited to locations in Italy and is based on Italian Wikipedia.

Volunteered geographical information (VGI) is one facet of phenomenon of crowdsourcing in which people collect and share large amounts data in open and collaborative projects. Although these projects have different purposes and scopes there is some overlap between them so it can be asked if this data, which is collected from different communities with different processes, is coherent. I will discuss a set of possible analyses between OSM and Wikipedia data, how they can be performed and a path for further research. I will also present some preliminary results of the application of these metrics regarding Italian Wikipedia and OSM in Italy for a given category of objects (churches and historical buildings).

Bachelor and Master Thesis

I have received both my BSc in Physics and my MSc in computational physics from the University of Milano-Bicocca (UniMiB), Milan.