{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PLwGNovEanAB"
   },
   "source": [
    "##### Copyright 2022 The TensorFlow Authors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "fePXTHt_Izkk"
   },
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cellView": "form",
    "execution": {
     "iopub.execute_input": "2024-07-19T09:53:39.602990Z",
     "iopub.status.busy": "2024-07-19T09:53:39.602707Z",
     "iopub.status.idle": "2024-07-19T09:53:39.607283Z",
     "shell.execute_reply": "2024-07-19T09:53:39.606575Z"
    },
    "id": "jUK4QgfmbGPS"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "# https://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lKy2xzTnbsaJ"
   },
   "source": [
    "# Creating a custom Counterfactual Logit Pairing Dataset\n",
    "\n",
    "<div class=\"devsite-table-wrapper\"><table class=\"tfo-notebook-buttons\" align=\"left\">\n",
    "  <td><a target=\"_blank\" href=\"https://www.tensorflow.org/responsible_ai/model_remediation/counterfactual/guide/creating_a_custom_counterfactual_dataset\">\n",
    "  <img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/model-remediation/blob/master/docs/counterfactual/guide/creating_a_custom_counterfactual_dataset.ipynb\">\n",
    "  <img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\">Run in Google Colab</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://github.com/tensorflow/model-remediation/blob/master/docs/counterfactual/guide/creating_a_custom_counterfactual_dataset.ipynb\">\n",
    "  <img width=32px src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\">View source on GitHub</a>\n",
    "</td>\n",
    "<td>\n",
    "  <a target=\"_blank\" href=\"https://storage.googleapis.com/tensorflow_docs/model-remediation/docs/counterfactual/guide/creating_a_custom_counterfactual_dataset.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
    "</td>\n",
    "</table></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "N4wgS7SGbzhL"
   },
   "source": [
    "Applying Counterfactual Logit Pairing (CLP) to evaluate and improve the fairness of your model requires a counterfactual dataset. You create a counterfactual dataset by duplicating your existing dataset and changing the new dataset to add, remove, or modify identity terminology. This tutorial explains the approach and techniques for creating a counterfactual dataset for your existing text dataset.\n",
    "\n",
    "You use your counterfactual dataset with the CLP technique by creating a new data object, `CounterfactualPackedInputs`, that contains the `original_input` and `counterfactual_data`, and looks like the following:\n",
    "\n",
    "`CounterfactualPackedInputs` looks like the following:\n",
    "\n",
    "```python\n",
    "CounterfactualPackedInputs(\n",
    "  original_input=(x, y, sample_weight),\n",
    "  counterfactual_data=(original_x, counterfactual_x,\n",
    "                       counterfactual_sample_weight)\n",
    ")\n",
    "```\n",
    "\n",
    "The `original_input` should be the original dataset that is used to train your Keras model. `counterfactual_data` should be a `tf.data.Dataset` with the original `x` value, the corresponding `counterfactual_x` value, and the `counterfactual_sample_weight`. The `counterfactual_x` value is nearly identical to the original value but with one or more of the attributes removed or replaced. This dataset is used to pair the loss function between the original value and the counterfactual value with the goal of assuring that the model’s prediction doesn’t change when the sensitive attribute is different. `original_input` and `counterfactual_data` need to be the same shape. You can duplicate values from `counterfactual_data` so that it’s the same number of elements as `original_input`. \n",
    "\n",
    "Properties of `counterfactual_data`:\n",
    "* All `original_x` values need to have references to an identity group \n",
    "* Each `counterfactual_x` value is identical to the original value, but with one or more of the attributes removed or replaced\n",
    "* Have the same shape as original input (you can duplicate values so that they’re the same shape) \n",
    "\n",
    "`counterfactual_data` does not need to:\n",
    "* Have overlap with data within original input \n",
    "* Have ground truth labels \n",
    "\n",
    "Here’s an example of what a `counterfactual_data` would look like if you remove the term \"gay\".\n",
    "```python\n",
    "original_x: “I am a gay man”\n",
    "counterfactual_x: “I am a man” \n",
    "counterfactual_sample_weight”: 1\n",
    "```\n",
    "If you have a text classifier, you can use [`build_counterfactual_data`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/counterfactual/keras/utils/build_counterfactual_data) to help create a counterfactual dataset. For all other data types, you need to provide a counterfactual dataset directly. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "npFvpoI2cG9-"
   },
   "source": [
    "## Setup\n",
    "\n",
    "You'll begin by installing TensorFlow Model Remediation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:53:39.611264Z",
     "iopub.status.busy": "2024-07-19T09:53:39.610668Z",
     "iopub.status.idle": "2024-07-19T09:53:41.082284Z",
     "shell.execute_reply": "2024-07-19T09:53:41.081468Z"
    },
    "id": "8ou41oj9cSJd"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting tensorflow-model-remediation\r\n",
      "  Using cached tensorflow_model_remediation-0.1.7.1-py3-none-any.whl.metadata (4.8 kB)\r\n",
      "Collecting dill (from tensorflow-model-remediation)\r\n",
      "  Using cached dill-0.3.8-py3-none-any.whl.metadata (10 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Collecting mock (from tensorflow-model-remediation)\r\n",
      "  Using cached mock-5.1.0-py3-none-any.whl.metadata (3.0 kB)\r\n",
      "Requirement already satisfied: pandas in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-remediation) (2.2.2)\r\n",
      "Requirement already satisfied: tensorflow-hub in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-remediation) (0.16.1)\r\n",
      "Requirement already satisfied: tensorflow>=2.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-model-remediation) (2.17.0)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: absl-py>=1.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (2.1.0)\r\n",
      "Requirement already satisfied: astunparse>=1.6.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (1.6.3)\r\n",
      "Requirement already satisfied: flatbuffers>=24.3.25 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (24.3.25)\r\n",
      "Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (0.6.0)\r\n",
      "Requirement already satisfied: google-pasta>=0.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (0.2.0)\r\n",
      "Requirement already satisfied: h5py>=3.10.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (3.11.0)\r\n",
      "Requirement already satisfied: libclang>=13.0.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (18.1.1)\r\n",
      "Requirement already satisfied: ml-dtypes<0.5.0,>=0.3.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (0.4.0)\r\n",
      "Requirement already satisfied: opt-einsum>=2.3.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (3.3.0)\r\n",
      "Requirement already satisfied: packaging in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (24.1)\r\n",
      "Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (3.20.3)\r\n",
      "Requirement already satisfied: requests<3,>=2.21.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (2.32.3)\r\n",
      "Requirement already satisfied: setuptools in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (71.0.3)\r\n",
      "Requirement already satisfied: six>=1.12.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (1.16.0)\r\n",
      "Requirement already satisfied: termcolor>=1.1.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (2.4.0)\r\n",
      "Requirement already satisfied: typing-extensions>=3.6.6 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (4.12.2)\r\n",
      "Requirement already satisfied: wrapt>=1.11.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (1.16.0)\r\n",
      "Requirement already satisfied: grpcio<2.0,>=1.24.3 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (1.65.1)\r\n",
      "Requirement already satisfied: tensorboard<2.18,>=2.17 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (2.17.0)\r\n",
      "Requirement already satisfied: keras>=3.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (3.4.1)\r\n",
      "Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (0.37.1)\r\n",
      "Requirement already satisfied: numpy<2.0.0,>=1.23.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow>=2.0.0->tensorflow-model-remediation) (1.26.4)\r\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow-model-remediation) (2.9.0.post0)\r\n",
      "Requirement already satisfied: pytz>=2020.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow-model-remediation) (2024.1)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: tzdata>=2022.7 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from pandas->tensorflow-model-remediation) (2024.1)\r\n",
      "Requirement already satisfied: tf-keras>=2.14.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorflow-hub->tensorflow-model-remediation) (2.17.0)\r\n",
      "Requirement already satisfied: wheel<1.0,>=0.23.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from astunparse>=1.6.0->tensorflow>=2.0.0->tensorflow-model-remediation) (0.43.0)\r\n",
      "Requirement already satisfied: rich in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (13.7.1)\r\n",
      "Requirement already satisfied: namex in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (0.0.8)\r\n",
      "Requirement already satisfied: optree in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (0.12.1)\r\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow>=2.0.0->tensorflow-model-remediation) (3.3.2)\r\n",
      "Requirement already satisfied: idna<4,>=2.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow>=2.0.0->tensorflow-model-remediation) (3.7)\r\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow>=2.0.0->tensorflow-model-remediation) (2.2.2)\r\n",
      "Requirement already satisfied: certifi>=2017.4.17 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from requests<3,>=2.21.0->tensorflow>=2.0.0->tensorflow-model-remediation) (2024.7.4)\r\n",
      "Requirement already satisfied: markdown>=2.6.8 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (3.6)\r\n",
      "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (0.7.2)\r\n",
      "Requirement already satisfied: werkzeug>=1.0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (3.0.3)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: importlib-metadata>=4.4 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from markdown>=2.6.8->tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (8.0.0)\r\n",
      "Requirement already satisfied: MarkupSafe>=2.1.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from werkzeug>=1.0.1->tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (2.1.5)\r\n",
      "Requirement already satisfied: markdown-it-py>=2.2.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from rich->keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (3.0.0)\r\n",
      "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from rich->keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (2.18.0)\r\n",
      "Requirement already satisfied: zipp>=0.5 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.18,>=2.17->tensorflow>=2.0.0->tensorflow-model-remediation) (3.19.2)\r\n",
      "Requirement already satisfied: mdurl~=0.1 in /tmpfs/src/tf_docs_env/lib/python3.9/site-packages (from markdown-it-py>=2.2.0->rich->keras>=3.2.0->tensorflow>=2.0.0->tensorflow-model-remediation) (0.1.2)\r\n",
      "Using cached tensorflow_model_remediation-0.1.7.1-py3-none-any.whl (142 kB)\r\n",
      "Using cached dill-0.3.8-py3-none-any.whl (116 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using cached mock-5.1.0-py3-none-any.whl (30 kB)\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Installing collected packages: mock, dill, tensorflow-model-remediation\r\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Successfully installed dill-0.3.8 mock-5.1.0 tensorflow-model-remediation-0.1.7.1\r\n"
     ]
    }
   ],
   "source": [
    "!pip install --upgrade tensorflow-model-remediation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:53:41.086614Z",
     "iopub.status.busy": "2024-07-19T09:53:41.086340Z",
     "iopub.status.idle": "2024-07-19T09:53:43.673216Z",
     "shell.execute_reply": "2024-07-19T09:53:43.672442Z"
    },
    "id": "w42tJVqpcTal"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2024-07-19 09:53:41.340953: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
      "2024-07-19 09:53:41.361880: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
      "2024-07-19 09:53:41.368395: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
     ]
    }
   ],
   "source": [
    "import tensorflow as tf\n",
    "from tensorflow_model_remediation import counterfactual"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "2tpF3OEleEDr"
   },
   "source": [
    "## Create a simple Dataset\n",
    "\n",
    "For demonstrative purposes, we’ll create counterfactual data from the original input using `build_counterfactual_dataset`. Note that you can also construct counterfactual data from unlabeled data (as opposed to constructing it from original input). You will create a simple dataset with one sentence: “i am a gay man” which will serve as the `original_input`.\n",
    "\n",
    "Note: The dataset created in this tutorial is a simple list of repeated text  for demonstration purposes only. Further this tutorial only demonstrates the steps for creating a counterfactual dataset and does not represent a real-world use case. \n",
    "\n",
    "## Build a Counterfactual Dataset \n",
    "\n",
    "As this is a text classifier, you can create the counterfactual dataset with `build_counterfactual_data` in two ways: \n",
    "1.   Remove terms: Use build_counterfactual_data to pass a list of words that will be removed from the dataset via `tf.strings.regex_replace`.\n",
    "2.   Replace terms: Create a custom function to pass to `build_counterfactual_data`. This might include using more specific regex functions to replace words within your original dataset or to support non-text features\n",
    "\n",
    "`build_counterfactual_dataset` takes in `original_input` and either removes or replaces terms  depending on what optional parameters you pass. In most cases removing terms (option 1) should be sufficient to run CLP, however passing a custom function (option 2) is available for more precise control on the counterfactual values.\n",
    "\n",
    "### Option 1: List of Words to Remove\n",
    "Pass in a list of gender-related terms to remove with`build_counterfactual_data`.\n",
    "\n",
    "When using simple regex to create the counterfactual dataset, keep in mind that this may augment words that shouldn’t be changed. It is good practice to check that the changes made to the `counterfactual_x` value make sense in the context of the `orginal_x` value. Additionally, `build_counterfactual_dataset` will return only the values including a counterfactual instance. This could result in a different shape dataset from `orginal_input`, but it will be resized when passed to `pack_counterfactual_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:53:43.677700Z",
     "iopub.status.busy": "2024-07-19T09:53:43.677252Z",
     "iopub.status.idle": "2024-07-19T09:53:46.121127Z",
     "shell.execute_reply": "2024-07-19T09:53:46.120206Z"
    },
    "id": "xAPFGLy_fKSm"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of starting values: 20\n",
      "original:  tf.Tensor(b'I am a gay man0', shape=(), dtype=string)\n",
      "counterfactual:  tf.Tensor(b'I am a  man0', shape=(), dtype=string)\n",
      "Length of dataset after build_counterfactual_data: 10\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
      "I0000 00:00:1721382824.212840   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.216631   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.220305   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.225695   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.237536   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.240949   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.244491   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.247862   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.251348   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.254836   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.258211   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382824.261555   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.510650   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.512802   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.514940   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.517172   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.519339   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.521281   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.523290   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.525275   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.527328   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.529283   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.531324   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.533303   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.572780   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.574793   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.576887   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.578906   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.581059   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.582998   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.584991   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.586968   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.588999   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.591422   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.593838   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n",
      "I0000 00:00:1721382825.596178   23039 cuda_executor.cc:1015] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355\n"
     ]
    }
   ],
   "source": [
    "simple_dataset_x = tf.constant(\n",
    "    [\"I am a gay man\" + str(i) for i in range(10)] +\n",
    "    [\"I am a man\" + str(i) for i in range(10)])\n",
    "print(\"Length of starting values: \" + str(len(simple_dataset_x)))\n",
    "\n",
    "simple_dataset = tf.data.Dataset.from_tensor_slices(\n",
    "            (simple_dataset_x, None, None))\n",
    "\n",
    "counterfactual_data = counterfactual.keras.utils.build_counterfactual_data(\n",
    "    original_input=simple_dataset,\n",
    "    sensitive_terms_to_remove=['gay'])\n",
    "\n",
    "# Inspect the content of the TF Counterfactual Dataset\n",
    "for original_value, counterfactual_value, _ in counterfactual_data.take(1):\n",
    "  print(\"original: \", original_value)\n",
    "  print(\"counterfactual: \", counterfactual_value)\n",
    "print(\"Length of dataset after build_counterfactual_data: \" +\n",
    "      str(len(list(counterfactual_data))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_ueC9K5qsvXH"
   },
   "source": [
    "### Option 2: Custom Function  \n",
    "\n",
    "For more flexibility around ways of modifying your original dataset, you can instead pass a custom function to `build_counterfactual_data`. \n",
    "\n",
    "In the example, you can consider replacing identity terms that reference men with those that reference women. This can be done by writing a function to replace a dictionary of words. \n",
    " \n",
    "Note that the only limitation on the custom function is that it must be a callable to accept and return a tuple in the format used in [`Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit) and should remove values that do not include any changes, which can be done by passing the terms to `sensitive_terms_to_remove`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2024-07-19T09:53:46.125875Z",
     "iopub.status.busy": "2024-07-19T09:53:46.125600Z",
     "iopub.status.idle": "2024-07-19T09:53:46.261597Z",
     "shell.execute_reply": "2024-07-19T09:53:46.260773Z"
    },
    "id": "L57yticErNJG"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Length of starting values: 20\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "original:  tf.Tensor(b'I am a gay man0', shape=(), dtype=string)\n",
      "counterfactual:  tf.Tensor(b'I am a gay man0', shape=(), dtype=string)\n",
      "Length of dataset after build_counterfactual_data: 10\n"
     ]
    }
   ],
   "source": [
    "words_to_replace = {\"man\": \"woman\"}\n",
    "print(\"Length of starting values: \" + str(len(simple_dataset_x)))\n",
    "\n",
    "def replace_words(original_batch):\n",
    "  original_x, _, original_sample_weight = (\n",
    "      tf.keras.utils.unpack_x_y_sample_weight(original_batch))\n",
    "  for word in words_to_replace:\n",
    "    counterfactual_x = tf.strings.regex_replace(\n",
    "        original_x, f'\b{word}\b', words_to_replace[word])\n",
    "  return tf.keras.utils.pack_x_y_sample_weight(\n",
    "      original_x, counterfactual_x, sample_weight=original_sample_weight)\n",
    "\n",
    "counterfactual_data = counterfactual.keras.utils.build_counterfactual_data(\n",
    "    original_input=simple_dataset,\n",
    "    sensitive_terms_to_remove=['gay'],\n",
    "    custom_counterfactual_function=replace_words)\n",
    "\n",
    "# Inspect the content of the TF Counterfactual Dataset\n",
    "for original_value, counterfactual_value in counterfactual_data.take(1):\n",
    "  print(\"original: \", original_value)\n",
    "  print(\"counterfactual: \", counterfactual_value)\n",
    "print(\"Length of dataset after build_counterfactual_data: \" +\n",
    "      str(len(list(counterfactual_data))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GKOUgoE4Og76"
   },
   "source": [
    "To learn more, please see the API documents for [`build_counterfactual_data`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/counterfactual/keras/utils/build_counterfactual_data)."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "creating_a_custom_counterfactual_dataset.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.19"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}