^{1}

^{2}

The authors have declared that no competing interests exist.

We use a rich set of transaction data from a large retailer in India and a dataset on bribe payments to train random forest and XGBoost models using empirical measures guided by Benford’s Law, a commonly used tool in forensic analytics. We evaluate the performance around the 2016 Indian Demonetization, which affects the distribution of legal tender notes in India, and find that models using only pre-2016 data or post-2016 data for both training and testing data had F1 score ranges around 90%, suggesting that these models and Benford’s law criteria contain meaningful information for detecting bribe payments. However, the performance for models trained in one regime and tested in another falls dramatically to less than 10%, highlighting the role of the institutional setting when using financial data analytics in an environment subject to regime shifts.

Forensic analytics has been used extensively in various applications to detect irregularities in financial statements [

In this paper, we study whether Benford’s Law is a viable methodology to detect bribe payments, and whether it depends on the economic regimes, by using a setting from India straddling the 2016 Demonetization event which changes the paper currency notes that are legal tender. We consider the use of forensic analysis in combination with simple machine learning models to detect bribe payments in a dataset containing identified bribe and non-bribe payments. We hypothesize that bribe payments would be detectable using Benford’s Law, as bribe payments are typically based on interpersonal negotiation, and payments are typically done in cash and based on round numbers. In particular, we study whether the predictability of Benford’s Law estimates is useful for a machine learning setup in the presence of regime shifts. The 2016 Indian demonetization of some legal tender notes are informative. It allows us to compare how Benford’s Law would perform in an out-of-sample analysis whereby the legal setting of payment methods change. We document three main findings.

First, bribe payments in India violate Benford’s Law with a large deviation for payments beginning with “5”. This leading digit corresponds to the second-largest banknote of 500 rupees. We find that retail payments both online and offline satisfy Benford’s Law, consistent with findings from the previous literature, which we discuss below. However, post-2016 demonetization, which ended the usage of 1,000-rupee banknotes and introduced the 200 and 2,000 rupee note, the fit of bribe payments to Benford’s Law improved slightly.

Second, a random forest model using Benford’s Law estimates appears useful to detect bribe payments, performing slightly better than an XGBoost model based on F1 score. We also document that oversampling the data on bribes is important in improving the performance of the random forest model. However, the false-positive rate still suggests that an approach relying only on Benford’s Law estimates would not be practically feasible as analysts would still need to do manual investigations.

Third, we document that a model trained on the pre-demonetization but tested on the post-demonetization has an F1 of only 12.5%, compared to a model trained on the post-demonetization and tested on data sampled from the same period that has an F1 score of over 93.6%. Our findings highlight the importance of adjusting models to different legal settings. In particular, any follow-on research seeking to operationalize our findings should take country-specific settings into account. Where appropriate, researchers should model potential deviations from Benford’s Law in a way informed by the economic regime, such as the set of legal tender notes.

Relative to the existing research, we make two main contributions. First, we combine Benford’s Law with a machine learning algorithm and evaluate the predictive power of specific measures based on Benford’s Law. Second, we emphasize the importance of domain-specific knowledge when deploying forensic analyses, based on an economic regime shift whereby the Indian government changed the legal tender notes in India. The latter analysis shows the sensitivity of machine learning and forensic analytic methodologies in the face of changing regimes. In doing so, we highlight the importance of both feature engineering and domain knowledge in deploying these methodologies. An implication of our finding is that as payments become digitized, we expect these models to perform worse over time as bribe payments would not necessarily be clustered around legal tender combination amounts.

Versions of Benford’s law has been used across various fields ranging from geophysical data analysis to studying election data. [

Our study also builds on the research using machine learning methods for forensic analytics. For example, [

To this end, our study builds upon the literature applying criteria from Benford’s law to machine learning models. Our approach to combine Benford’s Law with machine learning algorithms is most related to more recent literature combining forensic techniques with machine learning. For example, [

A common empirical problem involving fraud detection and forensic accounting is usually framed as classification problems, predicting a discrete class label output given a data observation. This work performs exploratory data analysis on datasets containing bribe and transactional data and aims to predict whether a given transaction is a bribe or not. The main challenge of this classification problem comes from the fact that most transactions are not fraudulent in real-world data. This results in an imbalanced dataset. Out of the many ways to deal with an imbalanced dataset, we use the oversampling technique SMOTE.

Finally, as our goal is to emphasize the importance of understanding the institutional setting with which to deploy a machine learning model, we highlight the importance of training datasets for live deployment by considering all four possible combinations of training and testing data using the pre or post demonetization data. As a baseline, we show the performance for regime-matched training and testing data in

Panel A: Pre-Demonetization | ||||||||

Model: | Random Forest | XGBoost | ||||||

Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |

without SMOTE | 95.4 | 53.0 | 3.50 | 6.60 | 95.4 | 54.8 | 3.20 | 6.10 |

with SMOTE | 91.0 | 98.1 | 85.8 | 91.6 | 91.0 | 86.0 | 98.0 | 91.6 |

Panel B: Post-Demonetization | ||||||||

Model: | Random Forest | XGBoost | ||||||

Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |

without SMOTE | 99.4 | 22.2 | 0.50 | 0.90 | 99.4 | nan | 0.0 | Nan |

with SMOTE | 93.2 | 88.5 | 99.4 | 93.6 | 93.2 | 88.5 | 99.4 | 93.6 |

Panel A: Trained on Pre-Demonetization and Tested on Post-Demonetization | ||||||||

Model: | Random Forest | XGBoost | ||||||

Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |

without SMOTE | 94.8 | 7.1 | 49.4 | 12.5 | 96.4 | 4.5 | 26.9 | 7.7 |

with SMOTE | 83.4 | 4.0 | 91.2 | 7.7 | 83.1 | 3.0 | 91.6 | 5.8 |

Panel B: Trained on Post-Demonetization and Tested on Pre-Demonetization | ||||||||

Model: | Random Forest | XGBoost | ||||||

Accuracy | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | |

without SMOTE | 99.2 | 80.9 | 11.8 | 20.6 | 99.4 | 100.0 | 0.3 | 0.6 |

with SMOTE | 87.9 | 5.3 | 89.3 | 10.0 | 87.2 | 3.8 | 88.2 | 7.2 |

The table above shows the performance of the classification using random forest and XGBoost models in the pre- and post-demonetization data, respectively. The table also shows whether the model used SMOTE or not. Panel A shows models trained up to 2016 and tested on post-2016 data. Panel B shows models trained on post-2016 and tested on data before and including 2016.

Benford’s Law, also known as the Law of First Digits or the Phenomenon of Significant Digits, states that rather than following a uniform distribution, the first digits of values in a dataset should follow a different distribution: “1” should be most frequent, followed by “2”, “3”, and so on. The stability and universality of Benford’s law has made it a mainstay tool in forensic analysis to detect fraud or other irregularities, since manipulated numbers typically violate Benford’s law. [

The Indian demonetization in November 2016 outlawed the use of the old 500- and 1,000-rupee notes. While new 500-rupee notes were reintroduced after demonetization, the 1,000 rupee notes were replaced by 2,000 rupee notes. If bribe transactions primarily take place with cash, a change in the denominations of cash will change the distribution of digits that we can expect to see. It is thus important to examine the data pre- and post-demonetization independently.

However, bribe payments violating Benford’s Law does not guarantee that it will have a practical use for detecting bribes in a sample with both legitimate and bribe payments. Therefore, we first evaluate whether bribe and non-bribe payments conform to Benford’s law, and then evaluate whether features constructed from transaction values guided by Benford’s law criteria are informative for predicting whether a transaction is a bribe payment.

We combine two primary data sources for our analysis. The first comes from a proprietary data set from a large retailer in India, with online and offline point-of-sales information. The second comes from a self-reported website called

Our sample is from August 1, 2010, to October 31, 2019. Analyses are conducted at the transaction level, concatenating the two different datasets, where a bribe is coded as one, and a non-bribe is coded as zero. We include the day, month, and year in the dataset. We also construct additional features such as the day of the week. Since the data do not span many years, we do not use the year or month in any analyses.

Neither the bribe nor legitimate transaction data include personally identifiable information or demographic data for different parties in the transaction. Therefore, our analyses must make use of very little data, namely the transaction date and the transaction amount. Based only on these two raw data fields, we construct additional features below.

We construct measures of the distribution of numbers across different digits as the feature engineering exercise for the machine learning application, based on intuition guided by Benford’s Law. We use these constructed features in a random forest model. Then, because bribe payments are rare (less than 4% in our data) relative to normal transactions, we train our machine learning model with synthetic minority oversampling (SMOTE) to ensure that model node splitting’s are informative for the portion of data of interest: those similar to bribe payments. We discuss the variable constructions and machine learning process below.

Using imbalanced data in a classifier will bias it towards the majority class. To counteract this imbalance, a data augmentation technique called the Synthetic Minority Oversampling Technique (SMOTE) can be used [

The machine learning models are a random forest model from the

For all our analyses, we consider a dataset split into 75% of randomly drawn observations into the training set and the remaining 25% into the testing dataset. When the sample come from mismatched economic regimes, we sample the number of data points such that the overall sample at 75% from one economic regime and 25% from another economic regime.

However, there are some other subtle differences between the samples. Post-demonetization, we find a significant deviation around the digit 2 post demonetization when 200- and 2,000-rupee notes were introduced. These descriptive statistics show that the demonetization of 1,000-rupee bills and the introduction of 200- and 2,000-rupee bills changed the distribution of the first digit of bribe payment amounts.

In addition, the weights for digits beginning with 3, 4, 6, 7, 8, and 9 are below what Benford’s law predicts. Interestingly, we see similar but less extreme deviations both in the point-of-sales purchase amounts and the e-commerce amounts. The “total” line aggregates point-of-sales, e-commerce, and bribe payments together. We see that all together, the pattern fits most closely with Benford’s Law. These empirical patterns suggest that Benford’s Law has the potential to be used to detect bribe payments from other kinds of normal transactions.

Although the distributional changes for the first digit show stark patterns, the use case for the first digit distribution for predicting transaction-by-transaction classifications is not obvious. To operationalize this distribution, we consider not just the first digit but also the distribution of other digits as well. Importantly, we adopt a machine learning approach instead of a standard linear model approach as the non-linearities around the distribution of digits based on different units (1’s, 10’s, 100’s, 1,000’s) may feature in non-trivial ways that are informative for prediction. Therefore, although visually the distributions of first digits may look similar for the pre- and post-demonetization samples, the information content contained in the distribution of first digits depends on the economic regime. When different units’ leading digits are used in the machine learning model, as a non-linear combination of data points, we are unable to ex ante evaluate whether the models would be highly sensitive to these subtle differences in the data, since we do not have a strong prior as to how different digits should combine to inform us about whether a payment is a bribe payment or not.

The

The table below shows the performance of the classification using random forest and XGBoost models in the pre- and post-demonetization data, respectively. The table also shows whether the model used SMOTE or not. Panel A shows models trained and tested on pre-demonetization data and Panel B shows models trained and tested on post-demonetization data. The time taken to train the random forest models are between 1 to 3 minutes for the random forest models without and with SMOTE, and the XGBoost models take between 0.5 to 1.5 minutes to train. Unsurprisingly, the XGBoost algorithm is much faster to train. Therefore, all the models that we consider can be trained and deployed in a practice with little time and computation costs.

The precision-recall curve summarizes the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. Both are useful in cases where there is an imbalance in the observations between the two classes. A no-skill classifier cannot discriminate between the classes and would predict a random or constant class in all cases. The no-skill line changes based on the distribution of the positive to negative classes. It is a horizontal line with the value equal to the ratio of positive cases in the dataset. For a balanced dataset, this value is 0.5. The two right subplots in

The subplots on the left show the model without SMOTE and those on the right use SMOTE to oversample the bribe payments. The subplots on the top train and test the model on pre-demonetization data and those on the bottom train and test the model on post-demonetization data. Panel A shows results for the random forest model and Panel B shows the results for the XGBoost model.

The receiver operating characteristics (ROC) curves trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. When the AUC for the ROC curve is 0.5, it means that the classifier cannot distinguish between Positive and Negative class points. This implies that the classifier predicts a coin-flip random class for all the data points. When the area under the curve is 1, the classifier can perfectly distinguish between all the Positive and the Negative class points. The two right subplots in

The subplots on the left show the model without SMOTE and those on the right use SMOTE to oversample the bribe payments. The subplots on the top train and test the model on pre-demonetization data and those on the bottom train and test the model on post-demonetization data. Panel A shows results for the random forest model and Panel B shows the results for the XGBoost model.

Overall, contrary to the hypothesis that Benford’s Law-type measures would not be informative for detecting bribe payments, we find evidence more consistent with the alternative hypothesis, that one can indeed use the distribution of digits to detect bribe payments in a highly imbalanced dataset.

In this next section, we study whether the models’ performance reflects the information in bribe payments that are not legitimate transactions and that depend on the economic regime. In particular, the regime that we evaluate is the different sets of legal tender notes. Therefore, we compare the performance of models trained in one regime but tested in another. To show our results’ robustness and emphasize the importance of economic regime alignment under which the models are trained and tested, we consider both cross-regime specifications. If we use the pre-demonetization data as the training set and the post-demonetization data as the testing set, we obtain the following results in Panel A of

Therefore, we compare the performance of models trained in one regime but tested in another. We consider both cross-regime specifications to show our results’ robustness and emphasize the importance of economic regime alignment under which the models are trained and tested. If we use the pre-demonetization data as the training set and the post-demonetization data as the testing set, we obtain the following results in Panel A of

This paper shows that forensic accounting techniques can be used with machine learning methodologies to detect bribe payments in a combined sample of bribes and normal transactions. In particular, given the extreme imbalance in the data, oversampling the bribe payments in training the model is important and improves recall by over ten-fold. However, bribe payments appear regime-specific and depend on the combination of legal tender notes in an economy. Using a model trained before the Indian demonetization in 2016 and used after the demonetization sees a large decrease in performance between 10 to 30% relative to models trained and tested in the same regime.

Our research shows the importance of domain or regime-specific knowledge in using these procedures. Users should not simply start a model and let it run always. Instead, analysts using such tools should consider the stability of their results in different settings, and if necessary, consider re-training the model on a new dataset if the underlying economic framework justifies it. Our results is a first step towards undersatnding how the underlying institutional setting in a payment system affects forensic analytic performance. Further research can take our existing results as a starting point for developing a potential diagnostic tool to detect shifts in the underlying data which may inform users whether a model is likely to be stale and needing re-calibration. In addition, we only consider the use of Benford’s Law in detecting bribe payments. Follow-on work may also consider alternative forensic analyses tools and distributions.