^{1}

^{2}

^{3}

^{3}

^{4}

^{1}

The authors have declared that no competing interests exist.

As a consequence of the accelerated globalization process, today major cities all over the world are characterized by an increasing multiculturalism. The integration of immigrant communities may be affected by social polarization and spatial segregation. How are these dynamics evolving over time? To what extent the different policies launched to tackle these problems are working? These are critical questions traditionally addressed by studies based on surveys and census data. Such sources are safe to avoid spurious biases, but the data collection becomes an intensive and rather expensive work. Here, we conduct a comprehensive study on immigrant integration in 53 world cities by introducing an innovative approach: an analysis of the spatio-temporal communication patterns of immigrant and local communities based on language detection in Twitter and on novel metrics of spatial integration. We quantify the

Immigrant integration is a complex process involving a multitude of aspects such as religion, language, education, employment, accommodation, legal recognition and many others. Its study counts with a long tradition in sociology through concepts such as immigrant assimilation [

In global terms while international migration flows have remained almost stable over the last 20 years [

In parallel, the last few years have brought a paradigm shift in the context of socio-technical data. Human interactions are being digitally traced, recorded and analyzed in large scale. Sources as varied as mobile phone records [

In this work, we introduce a novel approach to quantify the spatial integration of immigrant communities in urban areas worldwide. By analyzing language in Twitter data, we are able to assign languages to each user paying special attention to those corresponding to migrant communities in the city considered. The individuals’ digital spatio-temporal communication patterns allow us to define as well areas of residence. With this information, we perform a spatial distribution analysis through a modified entropy metric, as a quantitative way to measure the spatial integration of each community. The metric can be expressed in a bipartite network with the culture of origin in one side and the hosting cities, countries and languages in the other. These results lead us to categorize the cities according to how well they integrate immigrant communities and also to quantify how well hosting countries integrate people from other cultures.

We selected 53 of the most populated cities in the world (see

The cities passed through the lens of our analysis are mostly distributed over four continents (a). Africa has been not considered due to the lack of data. We cover each city with a square grid in order to keep a homogeneous spatial division over the whole urban area where the users are going to be distributed (b), selecting resident users and their most frequent location thanks to their activity over space and time (c). In addition, we assign the users’ most probable native language (d) and perform a spatial analysis over the cities (e) to get information about the population distribution in function of the language spoken by the users.

We will propose below a metric to assess spatial segregation of immigrant communities that is not highly sensitive to the specific borders of the area studied. However, everything has its limits. The mix of local and immigrant population is different in urban and rural areas. It is important thus to attain a balance and ensure that the region considered contains the city, where the signal on immigrants is stronger, but it does not extend unnecessarily far from it. This means that we should agree on a city definition that can be applied around the world and it is large enough to include the whole metropolitan area. Unfortunately, generic definitions such as the Larger Urban Zone (LUZ) definition of Eurostat for Europe does not exist at the global scale. There are plenty of different ways of defining cities, with, for example, methods based on urban growth, percolation, attraction or fractal theory. All these methods require third party data such as population, built-up area or flows of commuters that is not easily available in a consistent form everywhere. To side step this difficulty, we use a very pragmatic definition based only on the Euclidean distance and consider all activity within a frame of 60 × 60 km^{2} centered on the barycenters listed in Table B of the

As represented in

Users who are active within a given city for at least three consecutive months are considered to be residents, so this establishes the first condition

At this point, we are interested in introducing a method to determine which languages each user speaks, or at least in which languages he/she tweets. If any of these languages is proper of an immigrant community, this most likely will identify the user as a member of that community. To do this, the language in each tweet is detected using the version 2.0 of the

As can be expected, there are users tweeting in more than one language. We create a dictionary of the occurrences of each language in each users’ tweets pattern. English is one of the most frequent language per user, because of its diffusion as

To quantify the spatial segregation of each immigrant community in every city, we build a bipartite spatial integration network _{l,c}, corresponds to the level of spatial integration measured with a new metric inspired by the Shannon entropy, but modified to take into account the finite character of the sampling of communities in our Twitter database. Shannon entropy-like descriptors have been used before in this context especially when considering the spatial segregation of ethnic minorities in the US cities [_{l,i}. This allows us to define an entropy per language community ^{2} is the area of the cells, it is added to make the entropy stable against changes of spatial scale as proposed in Ref. [^{2} cells and, thus, a change in cell size as those shown in the Supporting Information for 1 × 1 and 2 × 2 square kilometers requires a correction factor 4 and 16, respectively, as expressed in _{l,c} by itself is not telling us anything about characteristic features of the community _{l,c} users associated to language _{l,c} are symptoms of segregation, whereas local languages and those distributed spatially in a similar manner are characterized by _{l,c} values close to unit. The values of this normalized ratio _{l,c} constitute the weights of the links in the bipartite network displayed in

The network comprises of two sets: _{l,c}. The size of the nodes is proportional to its degree and the color to its mean strength.

The stability of the spatial entropy in function of different cells sizes (different scales Δ^{2}) is studied in the Supporting Information. We evaluate the relative error among the links of the bipartite network in function of Δ

Twitter has the advantage of being a global source of data, but also the disadvantage of having several uncontrollable biases. Young people are usually over-represented [

Going step by step, let us consider first the influence of the geographical area chosen on the structure of the bipartite network between language communities and cities paying special attention to the weights of its links. For this, recall that we have selected areas of 60 × 60 km^{2} around the barycenter of the 53 cities considered. These areas have been further divided in cells of 500 × 500 m^{2}, which are the basic units of the analysis. The 53 cities are large megalopolis, still one can wonder if a square frame of 60 km side is enough to cover all of them, or whether we are including rural areas that could pollute the results. To check the stability of the network in function of the size of the city boundaries, we evaluate the relative error among the edge weights for different side sizes (20, 40, 80 and 100 km) using as reference the original 60 × 60 km^{2} frame. In particular, the relative change _{l,c} of the link weights in the bipartite spatial integration network taking as reference the 60 km side frame is computed as follows,
^{2} frame. Box plots displaying the distribution _{l,c} values for different frame side sizes can be found in

(a) Box plots of the relative change _{l,c} of the link weights in the bipartite spatial integration network taking as reference the 60 km side frame. (b) The entropy ratio _{l,c} for three examples of communities with more than 1000 detected users (Spanish in Chicago and Miami and Portuguese in Madrid). A random sub-sampling is extracted and the calculated ratio of entropies is displayed as a function of the sample size. (c) The ratio of entropies _{l,c} as a function of the community size in number of users with a valid residence for all the communities. Every points represent a linguistic community in a city. The red vertical line marks the level of 30 users taken as a threshold. In the inset, it is shown a zoom-in with the details of the main plot. (d) We present the results concerning the ratio of entropies of a null model in which users belonging to a immigrant community is allowed to reside only in a subset _{0} of cells. These users are distributed randomly in the _{0} cells, while the local population is randomly distributed across all the gird cells. In the numerical examples, the system contains 100 × 100 = 10000 cells. The figure shows how the ratio of entropies changes with the number of users in the immigrant community and how the curves depend in first order on the ratio between the number of users and _{0}.

A next question to consider concerns the minimum number of users needed to obtain a stable measure of _{l,c}. The number of users for whom we can detect a residence area per community are not very high (Table F in the _{l,c}, we select some of the most populous migrant communities, delete a fraction of their users at random and plot in _{l,c} as a function of the remaining users. Every random extraction produces a different value of _{l,c}, so in the plot we depict the average and the error bars obtained from the standard deviation. Besides, we mark with a shadowed areas the values between which _{l,c} lies for the extractions with the largest number of users. The results depend on the particular community, but in general the values of _{l,c} enter in the shadowed areas between 10 and 100 users, 30 corresponds to the middle ground in logarithmic scale. A more systematic check can be seen in _{l,c} for couples language-city is depicted as a function of the number of users associated to the particular community. After 30 users, there is no more clear dependency between _{l,c} and the number of users so it must reflect the spatial distribution of the communities. It is also possible to perform a more detailed check in a controlled environment by introducing a null model in which the local population is randomly but uniformly distributed across the grid forming the city, while the immigrant population can only appear in a subset _{0} of cells. In those cells the immigrants are also distributed uniformly and randomly. By tuning the number of immigrant users and _{0}, one can explore how the metric _{l,c} reacts to finite numbers (see _{0}, they are indistinguishable from the local population and thus the ratio _{l,c} starts in one. As the number of immigrant users gets over _{0}, the fact that their residence is restricted to a certain area of the city becomes evident and _{l,c} decays towards a fixed value. As can be seen in the inset of _{0}. The curves showing _{l,c} as a function of the number of immigrants collapse by considering them as a function of such ratio. In general terms, the metric _{l,c} reaches a stable value once the number of immigrants is between 10 and 20 times larger than the cells where the community concentrates _{0}. This model is a worst-case scenario for testing _{l,c}, since the immigrants distribute uniformly while in more realistic applications if a ghetto exists the concentration density will not be uniform. In this latter case, lower number of users are required to measure the stable value of _{l,c}.

Finally, we have been also able to run a comparison between the spatial distribution of the communities detected in three cities for which the data from census offices was available. These cities are Barcelona, London and Madrid, and for the comparison we use data from the so-called Continuous Register Statistics in Spain and the Census Office in the UK. In the Spanish case, the information is collected when people residing in a certain area must inform the municipal authorities for tax purposes and to obtain social services such as health care. The smallest spatial units for this dataset are census tracts, so Twitter data must be translated into the same geographical units (see the Supporting information for further details). We employ the Anselin Local Moran’s

Global Moran’s

City | Language | Z-value | Autocorrelation | |
---|---|---|---|---|

Barcelona | Total | 0.63 | 236.5 | Positive |

Spanish | 0.62 | 217.0 | Positive | |

English | 0.50 | 230.5 | Positive | |

French | 0.37 | 151.5 | Positive | |

Italian | 0.28 | 125.8 | Positive | |

Portuguese | 0.32 | 151.2 | Positive | |

Arabic | 0.08 | 89.9 | Random | |

East-Slavic | 0.21 | 112.8 | Positive | |

London | Total | 0.71 | 66.5 | Positive |

English | 0.34 | 35.9 | Positive | |

Spanish | 0.27 | 28.1 | Positive | |

French | 0.25 | 32.1 | Positive | |

Italian | 0.26 | 31.9 | Positive | |

Portuguese | 0.15 | 18.5 | Positive | |

Arabic | 0.34 | 48.5 | Positive | |

Madrid | Total | 0.62 | 268.6 | Positive |

Spanish | 0.62 | 267.3 | Positive | |

English | 0.32 | 159.2 | Positive | |

French | 0.37 | 151.5 | Positive | |

Italian | 0.26 | 146.3 | Positive | |

Portuguese | 0.44 | 204.9 | Positive | |

Arabic | 0.07 | 41.5 | Random | |

East-Slavic | 0.06 | 37.7 | Random |

Once the limits of the data and the method to assess the spatial segregation levels of foreign communities have been checked, it is the moment to advance and study what can be said about the way that the cities integrate the foreign groups detected in Twitter. To this end and starting from the bipartite spatial integration network, we perform a clustering analysis based on the distribution of edge weights _{l,c}. For each city _{max}, namely, for London. We then perform a clustering analysis to find cities exhibiting similar distribution of edge weights by using a k-means algorithm based on Euclidean distances. The results of the analysis are confirmed by repeating the clustering detection with a Hierarchical Clustering Algorithm yielding the same results (see Fig B in the

_{c} is the number of languages spoken in city _{max} is the maximum number of languages across the whole set of cities, _{c} is maximum when the median of the entropy ratio distribution is one or over, _{2} = 0, or when the _{c} ranging from Tokyo’s 0.41 to London’s 0.79; the former city shows good integration of massive communities coming from South Korea, Philippines and China. On the other side, the British capital shows almost full spatial mixing of a very large number of foreign communities. Cities belonging to cluster C2 are characterized by values of _{c} ranging from Jakarta’s 0.10 (characterized by mixing segregation behaviors in a scenario of spatial uniformity of most of the communities) to the 0.37 reached on the urban area of Philadelphia; here we found several communities that are uniformly spread within the city, whereas segregation appears focusing on the Arabic speaking community. The cluster as a whole mixes first segregation behaviors in a scenario of several communities involved in the process. Finally, cluster C3 is when both low number of immigrant communities are not well uniformly distributed within the urban areas, proved by the fact that _{c} are very low. Brussels’s 0.01 is due to the low values of entropy of the Turkish community within a scenario of few immigrant communities. Toronto, on the other side, is characterized by a very high number of immigrant communities (comparable to cities found in the cluster C2), not being well spatial integrated within the urban environment. This leads to a _{c} value of 0.12. Note that the clusters are obtained directly from the similarity between vectors _{l,c} in the vectors and _{c}.

In (a), three groups of cities show similar behavior in the number of communities detected and in their levels of integration. The length of the vectors represents the number of languages (communities) detected in each city; the color scale is representative of the decay of the entropy metric; the _{l,c} for the cities in each cluster. The points correspond to the values of the elements of

The bipartite spatial integration network can be also be projected into the language side to gain insights on the level of integration of languages into the different countries (see Table H in the _{l,c} to build the network. The best and the worst cases of integration are displayed in

We select the sub-network representative of the best levels of spatial integration of languages in countries and display it on the left of the figure. The network is formed by the top 10% links according to the entropy distribution (the spread of the values can be seen in the boxplot (a) in comparison with all the values of _{l,c}). In addition, we include an extra 10% of links (dash-lines) to the network, those between 10% and 20% best links (their spread is in the boxplot (b)). In the network only nodes that belong to the top set are highlighted. Similarly, on the right, the worst levels of spatial integration of languages in countries are shown. We filter out the bottom 10% links according to the entropy distribution (their spread of values is in the boxplot (c)), and add an extra 10% of links to the network (dash-lines), those links between the 10% and 20% worst in the ranking. Their spread is in the boxplot (d). As before, only the nodes that belong to the worst set are highlighted.

People are constantly moving within cities and countries, looking for jobs, experiences or just for better life conditions, facing the fact of the integration in habits and laws of new local cultures. Migration flows have been studied so far by means of surveys and census data that cover from the number of people living outside their country of birth to place of residency to features of the labor market. However, census and surveys have the disadvantage of a very high cost, geographical limitations and, typically, they have slow update frequencies. Recent works by experts in the area highlight the dare need of more agile data sources about mobility and settlement patterns of immigrant and refugee communities.

Rather than using these classical sources, in this work we explore the capability of the online social networks to provide information about the integration of immigrant communities. In particular, we use Twitter to connect users to their residence place and via a language

This file includes 10 tables (Table A: Number of users and tweets in each city; Table B: Location of the city centers; C: Number of users residing in each city; D: Language aggregation process; E: Local languages in each city; F: Number of residents per language and city; G: Power of Integration of the cities; H: City-Country correspondence; I: Languages and country correspondence; J: Data validation) and 7 figures (A: Number of reliable users as a function of filtering parameters; B: Comparison of clustering methods for the cities; C: Degree and weight distribution of the bipartite networks with and without English; D: Data validation 1; E: Data validation 2; F: Relative error of the entropy as a function of the scale Δ

(PDF)

(PDF)

Partial financial support has been received from the Spanish Ministry of Economy (MINECO) and FEDER (EU) under the project ESOTECOS (FIS2015-63628-C2-2-R), and from the EU Commission through project INSIGHT (611307). The work of M-HS-O was supported in part by a post-doctoral fellowship of MINECO at Universidad Complutense de Madrid (FPDI 2013/17001). BG thanks the Moore and Sloan Foundations for support as part of the Moore-Sloan Data Science Environment at New York University.