^{1}

^{2}

^{1}

The authors have declared that no competing interests exist.

From administrative registers of last names in Santiago, Chile, we create a surname affinity network that encodes socioeconomic data. This network is a multi-relational graph with nodes representing surnames and edges representing the prevalence of interactions between surnames by socioeconomic decile. We model the prediction of links as a knowledge base completion problem, and find that sharing neighbors is highly predictive of the formation of new links. Importantly, We distinguish between grounded neighbors and neighbors in the embedding space, and find that the latter is more predictive of tie formation. The paper discusses the implications of this finding in explaining the high levels of elite endogamy in Santiago.

In countries where citizens inherit both parents’ last names, surnames are a source to uncover the population structure of society. In previous research [

The method predicts links as a knowledge base completion problem. We evaluate several state-of-the-art knowledge base completion models and find that TuckER, which is based on a three-way tensor decomposition, obtains the best results.

The findings bear implications for the literature that examines the mechanisms behind group segregation. Traditional explanations focus on homophily, the propensity of people to connect to others of their own kind—the “birds of a feather” mechanism [

Our experiments show that triadic closure operates not only on the empirical network. We compute the embeddings of all nodes, and find that those that share a common neighbor in the

The substantive implication is that class divisions in Chile do not occur only because people prefer connecting and marrying others of their own socioeconomic status, but because they connect to others occupying similar locations in the network of interactions.

Along with the publication of this study, we release the surname affinity network SA19k (

We use the data introduced in Bro & Mendoza [

Individuals’ socioeconomic status was assigned based on the mean socioeconomic level of the blocks where they live. Socioeconomic status was transformed into a 0–100 range using the formula

We build a knowledge base that comprises paternal-maternal surname ties and income distributions based on the data described above. To create the knowledge base, we build an undirected network based on paternal-maternal surname affinities, where each node represents a surname. To create the edges, we count the number of individuals who share a pair of surnames, irrespective of the paternal-maternal or maternal-paternal order. Each pair of nodes was connected with ten relations, representing the deciles of the income distribution. The edge’s weight represents the number of individuals in each decile of the income distribution. This network has 76649 surnames and 1750802 edges.

Following Mateos _{ss} is less than a threshold _{s1} is the number of occurrences of the surname _{1}, _{s2} is the number of occurrences of the surname _{2}, and

Mateos

a) Even though many nodes have around 5–7 neighbors, the degree distribution shows the presence of hubs; b) Most of the affinity relationships occur in the highest income decile, totaling the 29% of the edges of SA19k. On the other hand, affinities in the lowest decile are almost unexistent. The other income deciles are quite similar, with around 8% of the edges per decile.

SA19k is highly modular, with modularity score = 0.802. Low-middle, and high-income deciles are grouped into two modularity-based partitions. The blank nodes do not belong to any of them. Low-middle income distribution deciles (in red color) show less nodes and hubs than the rest of the network. High income deciles (in blue color) show more nodes and hubs than the rest of the network. The size of each node represents its degree.

We study SA19k using predictive modeling. SA19k can be seen as a knowledge graph for surname affinity conditioned on income distribution. A predictive model based on SA19k can reveal unexpected connections between surnames. Our approach to predictive modeling is based on knowledge graph completion, where a model learns representations of nodes and edges to predict missing links. Link prediction reveals triplet-based facts, where each triplet represents a source and target surname, and each edge corresponds to a decile in the income distribution.

We model SA19k as a knowledge graph as follows. Let

Knowledge graph completion is the task of using

Machine learning-based approaches for knowledge graph completion use

During the prediction phase, for a given head entity

Similarly, the link prediction task can be defined for head entities, looking for the entity

To cover the broadest possible range of methods and architectures in the evaluation, we identified representative methods of different model families, taking care that these methods achieve state-of-the-art performances in knowledge graph completion and have open-source implementations that favor the reproducibility of the reported results. We identified three families of models:

Tensor decomposition models: These models are based on tensor decomposition techniques of the KG adjacency matrix, which is modeled as a three-dimensional tensor. The tensor representation consists of a set of stacked relationship adjacency matrices. These tensors are decomposed into a combination of low-dimensional vectors used as embeddings for the entities and relations of the KGs. Since tensor decompositions, in general, do not overfit the empirical KG, they are expected to have better generalization capabilities than other methods. We identified three methods of this family that show competitive results in benchmark data [

Geometric models: Geometric models interpret relationships as geometric transformations in the latent space of entity embeddings. These methods adjust the parameters of the node embeddings and the geometric transformation function that represents the relations. Given a head embedding, the transformation in the latent space maps it to the tail embedding of the observed fact. The parameters of the embeddings and relationships are adjusted to minimize the distance in the latent space between the head and tail embeddings connected by empirical facts. Geometric models need to verify constraints to ensure that the transformations in latent space are affine. We identified four geometric models representing this family: TransE [

Deep learning models: These models use layered neural network architectures to learn the embeddings of the KG entities and relationships. The embeddings are learned in conjunction with the weights and biases of each layer of the network, which makes these models gain in expressiveness but assume risks of overfitting. We identified three methods from this family of models that show competitive results in benchmark data [

For model selection, we partitioned SA19k into training, validation, and testing folds. We did this by sampling triplets at random. Both the validation and testing partitions have 5000 triplets each, leaving the training partition with 177563 triplets. The proportions between training/validation/testing triplets are typical in knowledge base completion, and very similar to those used to study predictability in the Wordnet database [

The best performing method in SA19K, TuckER [_{ijk}.

We introduce some definitions needed to define the Tucker decomposition properly. An ^{(i)} represents the

An essential operator of the Tucker decomposition is the _{1} × … × _{N−1} × _{N+1} … _{n}. An entry in

The Tucker decomposition factorizes a tensor

Balazevic

The scoring function for triplet facts is defined for each triplet 〈_{h}, _{r}, _{t}〉 using the element-wise tensor product per node:
_{h} and _{t} are rows in _{r} is a row in

For training, TuckER adds a reciprocal relation _{h}, _{r}, _{t}〉. The training step uses a 1: _{h}, _{r}, _{t}〉, TuckER scores

The best fit reported in _{r} = 10, which matches the number of relationships in SA19K. The dimensionality of the entity space was tested with values in {100, 200, 500, 1000}. The best result was obtained for _{e} = 200. TuckER considers three dropout factors. The first operates on the input, which corresponds to the tensor representation of the entities. A second dropout factor applies to the tensor product between the tensor representations of the relationships and the Tucker core tensor. Finally, a third dropout factor is applied when calculating the tensor product between the previous tensor and the tensor representation of the entities. This final product corresponds to the tensor representations of the pair 〈

Red and blue colors indicate high and low-value entries in the relation matrices, respectively. These matrices are computed using

For each triplet in the test partition, TuckER evaluates the pair 〈

We evaluate the ability of our model to forecast the emergence of new triangles. We consider three sources of information. The first source consists of the original network, which we call grounded triplets. If two unconnected nodes have many neighbors in common, a quantifiable measure known as the shared nearest neighbors fraction (SNN), the model expects them to connect. SNN measures the fraction of neighbors in common over the union of the neighborhoods of two entities. SNN is a very relevant measure in clustering analysis [

a) The fraction of SNN (average) varies depending on which source of information we consider. The most informative source is proximity-based SNN, although there is significant support based on grounded SNN from the original network. b) Grounded triplets per income decile. The fraction of grounded triplets is always less than half of the predicted triplets, except in decile 2, where it reaches almost 60% of the predictions.

Figures a and d show the connection of a larger cluster and a smaller one, exhibiting an absorption pattern. Figures b and c show the relationship of two clusters of similar sizes. In all the cases indicated, the new connection (marked with a red dotted line) creates a shorter path between both micro-clusters, decreasing the resulting graph’s diameter.

Most of the predicted triplets (57% on average across deciles) come from micro-clusters without direct connections like those shown in

The results show great predictability in the formation of ties between people holding certain surnames. Triadic closure is an empirical regularity well established in the social sciences [

Chile is one of the most unequal countries in the world [

The results of this study reveal the importance of the second mechanism, a dimension of socioeconomic segregation that is independent from homogamy: the subjective preference that people may have to connect with others like themselves [

Importantly, we make a distinction between empirical and more abstracts forms of network proximity. Neighbors may come in the form of grounded triplets, but also as proximate points in the embedding space. We show that triadic closure occurs not only in the empirical network, but also in the embedding space. If nodes A and B are both similar to C, they have a good chance of forming a tie in the network. Link prediction tasks therefore would benefit from taking not only the empirical graph into account but also the embeddings of nodes.

This paper develops a model to predict the formation of new links in the network of surname affinity in Santiago, Chile (SA19k). It formulates the problem as a knowledge base completion task, and finds that the TuckER method produces the best results. The results show that the formation of affinity links between surnames in Santiago, Chile, is highly predictable. We find that proximity in the network explains a large proportion of new links. Importantly, the method distinguishes between proximity in the empirical network and proximity in the embedding space, and finds that the latter is more predictive of link formation than the former.

The results shed light on an unexplored dimension of socioeconomic segregation. People may or may not prefer connecting with and marrying others belonging to their same socioeconomic status. However, given that the network of social interactions places constraints on their options, they are still likely to connect with or marry in-groups. Extensions of this paper could incorporate proximity in the embedding space as a parameter in statistical models that aim to disentangle the effects of different types of network proximity in explaining the social segregation that we often observe in empirical networks.

The results show the performance of the models in hits@{1,3,10}, and Mean Reciprocal Rank (MRR).

(PDF)

Thanks to Josué Tapia and Andrés Cruz for helping in the data-building phase of this project. Other persons that helped are Daniel Alcatruz, Sebastián Huneeus, and Johans Peña.