No description has been provided for this image

Links: Navigator Page | Chemical Index | State Index | Operator Index


openFF logo

Open-FF

Open-FF Data Dictionary


This file was generated on August 08, 2025
from data repository: openFF_data_2025_08_07.

FracTracker logo

Sponsored by FracTracker Alliance


Description of the contents of the final data files generated by Open-FF from the FracFocus data.¶

Pulling repo tables from: G:\My Drive\production\repos\openFF_data_2025_08_07\pickles
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[2], line 4
      1 def addfield(dict,fn, table):
      2     dict.setdefault(fn, []).append(table)
----> 4 tables = fh.get_repo_tables()
      5 all_fn = {}
      6 for t in tables.keys():

File c:\MyDocs/integrated\openFF\common\file_handlers.py:108, in get_repo_tables(pkl_dir)
    106     if fn[-8:] == '.parquet':
    107         name = fn[:-8]
--> 108         tables[name] = pd.read_parquet(os.path.join(pkl_dir,fn))
    109 return tables

File ~\anaconda3\envs\openFF\Lib\site-packages\pandas\io\parquet.py:667, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, dtype_backend, filesystem, filters, **kwargs)
    664     use_nullable_dtypes = False
    665 check_dtype_backend(dtype_backend)
--> 667 return impl.read(
    668     path,
    669     columns=columns,
    670     filters=filters,
    671     storage_options=storage_options,
    672     use_nullable_dtypes=use_nullable_dtypes,
    673     dtype_backend=dtype_backend,
    674     filesystem=filesystem,
    675     **kwargs,
    676 )

File ~\anaconda3\envs\openFF\Lib\site-packages\pandas\io\parquet.py:274, in PyArrowImpl.read(self, path, columns, filters, use_nullable_dtypes, dtype_backend, storage_options, filesystem, **kwargs)
    267 path_or_handle, handles, filesystem = _get_path_or_handle(
    268     path,
    269     filesystem,
    270     storage_options=storage_options,
    271     mode="rb",
    272 )
    273 try:
--> 274     pa_table = self.api.parquet.read_table(
    275         path_or_handle,
    276         columns=columns,
    277         filesystem=filesystem,
    278         filters=filters,
    279         **kwargs,
    280     )
    281     result = pa_table.to_pandas(**to_pandas_kwargs)
    283     if manager == "array":

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\parquet\core.py:1811, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification)
   1799     # TODO test that source is not a directory or a list
   1800     dataset = ParquetFile(
   1801         source, read_dictionary=read_dictionary,
   1802         memory_map=memory_map, buffer_size=buffer_size,
   (...)
   1808         page_checksum_verification=page_checksum_verification,
   1809     )
-> 1811 return dataset.read(columns=columns, use_threads=use_threads,
   1812                     use_pandas_metadata=use_pandas_metadata)

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\parquet\core.py:1454, in ParquetDataset.read(self, columns, use_threads, use_pandas_metadata)
   1446         index_columns = [
   1447             col for col in _get_pandas_index_columns(metadata)
   1448             if not isinstance(col, dict)
   1449         ]
   1450         columns = (
   1451             list(columns) + list(set(index_columns) - set(columns))
   1452         )
-> 1454 table = self._dataset.to_table(
   1455     columns=columns, filter=self._filter_expression,
   1456     use_threads=use_threads
   1457 )
   1459 # if use_pandas_metadata, restore the pandas metadata (which gets
   1460 # lost if doing a specific `columns` selection in to_table)
   1461 if use_pandas_metadata:

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\_dataset.pyx:562, in pyarrow._dataset.Dataset.to_table()

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\_dataset.pyx:3804, in pyarrow._dataset.Scanner.to_table()

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File ~\anaconda3\envs\openFF\Lib\site-packages\pyarrow\error.pxi:88, in pyarrow.lib.check_status()

OSError: [Errno 22] Invalid argument
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 kys = list(all_fn.keys())
      2 tbls = []
      3 for k in kys:

NameError: name 'all_fn' is not defined
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 3
      1 # find fields created outside of the tables
      2 for col in full.columns:
----> 3     if not col in kys:
      4         #print(col)
      5         kys.append(col)
      6         tbls.append('filter flag')

NameError: name 'kys' is not defined
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[6], line 5
      3 num_val = []
      4 dt = []
----> 5 for k in kys:
      6     try:
      7         uniq.append(full[k].nunique())

NameError: name 'kys' is not defined
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 all_fn_df = pd.DataFrame({'fieldName':kys,'Database_Tables':tbls,'Data_type':dt,'Num':num_val,
      2                           'Unique':uniq})

NameError: name 'kys' is not defined

Acceptable use of FracFocus data¶

One requirement for using the FracFocus data is stipulated on the FracFocus website:

"Downloaded data may be aggregated or combined with other datasets, but the FracFocus data may not be altered in any way."

Please read the entire "Terms of use" at http://fracfocus.org/data-download.

The work in this project maintains the original FracFocus data as is reported in the bulk download. The field names used in the original are kept: All of these original names begin with an upper-case letter and can be identified in that way. Fields generated by this project or from external data sources will begin with a lower case letter (for example, CASNumber is the original field, bgCAS is the generated field. Note there are two exceptions: DTXSID and MI_inconsistent are NOT original with FracFocus.)

In the zipped bulk download from FracFocus, a data dictionary is provided in the 'readme.txt' file. (This zipped download is in the /sources or /data directory and we rename it as 'currentData.zip') This file gives some information about many of the fields found; however, it is written for the SQL database version of the bulk download, not the CSV version which we use in this project. Further, some important fields are not mentioned in that readme.txt file; they are described below. In the descriptions of all fields below, we cite the FracFocus text from a June 2021 bulk download.

Descriptions of fields in the output data sets¶

Explanation of columns in the table below
column what it is
fieldName: The name of the field or column in the data set. All field names that are capitalized are from the original FracFocus downloaded data. Lower-case names are generated by Open-FF.
tables: Which Open-FF internal tables that are used to construct output data sets have this field
FracFocus description: Description of the (original) field given by FracFocus in the bulk download file, readme.txt.
Open-FF description: Our description of the field
source: is this field a direct copy of the original FracFocus data or is it generated by Open-FF, or pulled from an external data set?
Num: the number of non-empty values in the field
Unique: the number of unique types (including NaN) in the field
Data_type: the python/pandas data type for the field
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 260
    257 ffdesc_df = pd.DataFrame({'fieldName':kys,'FracFocus description':ffdescs})
    258 ffdesc_df['source'] = 'original FF'
--> 260 table = pd.merge(all_fn_df,desc_df,on='fieldName',how='outer')
    262 # drop rows that should have been cut elsewhere!
    263 table = table[~table.fieldName.isin(['OpCount','OperatorYears','SupCount','SupplierYears','chemInfo_available','cleanName','is_new','rawName',
    264                     'synCAS','source','syn_code','status','xlateName',])]

NameError: name 'all_fn_df' is not defined

Carrier detection sets:¶

Among the filters below, s1 finds the majority of water carriers. However, there is no single set of criteria that can be used to identify the water carrier record(s) for all FracFocus disclosures. Therefore the other filters are employed to catch many other disclosure patterns without needing to curate each by hand.

Set name description Criteria to be detected
s1 Primary filter; most recent disclosures are detected with this - Only one record whose Purpose is "carrier" (or related)
- bgCAS is '7732-18-5'
- at least 50% PercentHFJob
- total % of disclosure is 95% > x > 105%
s2 More than one record as the carrier;
covers situations, for example, where there are two water records
(fresh and produced) and where other chemicals are also labeled as part of the carrier.
It is important to include all water carrier records
to avoid underestimating carrier mass
- More than one record whose Purpose is "carrier" (or related)
- at least one bgCAS is '7732-18-5'
- total of water records is at least 50% PercentHFJob
- total % of disclosure is 95% > x > 105%
s3 No carrier records labeled; but clear water record with typical percentage - bgCAS is '7732-18-5'
- at least 40% PercentHFJob
- IngredientName contains phrase "including mix water"
- total % of disclosure is 95% > x > 105%
s4 Like s3, but CAS number missing; still obvious water record - CASNumber is empty
- at least 60% PercentHFJob
- IngredientName contains phrase "including mix water"
- total % of disclosure is 95% > x > 105%
s5 Like s1 but no carrier records are labeled;
- bgCAS is '7732-18-5'
- at least 50% PercentHFJob
- total % of disclosure is 95% > x > 105%
s6 CASNumber missing but clear carrier label - bgCAS is ambiguousID
- single record with a carrier Purpose
- IngredientName is either "carrier" (or related) or has "water" in it
- TradeName has "water" in it
- 50% < %HFJob < 100%
- total % of disclosure is 95% > x > 105%
s7 Like s1, but for "salted" water
Note that even though the record is labeled with the salt CAS number,
the predominant mass is water
- Only one record whose Purpose is "carrier" (or related)
- bgCAS is either '7747-40-7' (kcl) or '7647-14-5' (nacl)
- at least 50% PercentHFJob
- total % of disclosure is 95% > x > 105%
s8 Common pattern in the older disclosures (incl. SkyTruth archive) - bgCAS is ambiguousID or 7732-18-5
- IngredientName is MISSING
- Purpose is "unrecorded purpose"
- TradeName has either "water" or "brine"
- can be one or two records in each disclosure
- 50% < sum of PercentHFJob of these records < 100%
- total % of disclosure is 95% > x > 105%
s9 Common pattern in the older disclosures (incl. SkyTruth archive) - bgCAS is ambiguousID or 7732-18-5
- IngredientName is MISSING
- Purpose is one of the standard carrier words or phrases
- TradeName has either "water" or "brine"
- can be one or two records in each disclosure
- 50% < sum of PercentHFJob of these records < 100%
- total % of disclosure is 95% > x > 105%
s10 A pattern seen in later disclosures:
the carrier is only reported in the top part of the
systems approach section under the "Listed Below" CASNumber.
The actual PercentHFJob value isn't even reported in the PDF
version, but is in the bulk download.
- CASNumber is "Listed Below"
- record has a carrier Purpose
- PercentHFJob>50 %
- TradeName has "water" in it
- total % of disclosure is 95% > x > 105%

Disclosures with detected problems for determination of water carrier ID¶

code description
0 Disclosure has no valid chemical records.
1 TotalBaseWaterVolume is empty or 0 gallons.
2 None of the chemical records have non-zero PercentHFJob.
3 The sum of PercentHFJob values for valid CAS records is larger than limit (105%)
4 The sum of PercentHFJob values for all records excluding SystemApproach is larger than limit (105%)
5 PercentHFJob of all "proppant" records is greater than 50% (not used after v16)
6 The sum of PercentHFJob values for all records is less than 90% - a partial disclosure
7 PercentHFJob of Nitrogen or Carbon Dioxide records is greater than 50% (so carrier will be smaller) (not used after v16)
8 PercentHFJob of Chlorine dioxide records is 100% (it is typically an additive to the water; not a replacement) (added 3/2023, after v16).
9 PercentHFJob of Nonwater carrier record too large (>50%) (added 3/2023, after v16).