A Dataset of Late 1990s and Early 2000s Web Banner Ads on Chinese-and English-language Web Pages

This dataset contains information about 22,915 banner advertisement images appearing on Chinese-and English-language web pages in the late 1990s and early 2000s archived on the Wayback Machine. For each ad image, the dataset provides information about the image’s format and size, archived URLs of the image file, archived web pages the image appeared in

Online banner ads have been a subject of scholarly inquiry in a variety of fields since their inception almost three decades ago.While the vast majority of existing scholarly literature on banner ads focuses on user interaction with the ads, especially factors that may influence user engagement with the ads (Burke et al., 2005;Lohtia et al., 2003;Resnick & Albert, 2014), there is also a growing body of literature that examines the cultures and histories of banner advertising (Ankerson, 2018;Jessen, 2010;Li & Zhunag, 2007).However, to date there has not been a systematic dataset of historical banner ad images openly available for researchers.While museums, archives, advertising firms, and independent archivists have long been collecting advertisements in different mediums, few conventional archives and collections have systematically documented or preserved web banner ads.In his 2018 review of 179 archives and collections of advertising, advertisement scholar Fred Beard found no established museum and university archives collecting digital advertisements.Among archives and collections maintained by advertisers, industry, and individual archivists, only nine of them included digital advertising in their collections (Beard, 2018).The only archive entirely dedicated to web advertising identified by Beard, Adverlicious, 1 is no longer accessible online.
In 2016, the Internet Archive launched GifCities, a search engine that allows the user to search for GIF images that exist in archived GeoCities web pages.A sizable amount of GIF files searchable on GifCities are banner ads, but GifCities only covers images appearing on GeoCities (jefferson, 2016).In 2018, independent archivist Tyler Grant released an archive of Flash-based banner ads that he manually downloaded from the Nielsen Ad Relevance database (Haskins, 2018).However, the archive does not contain any metadata about the downloaded Flash files, and the archive does not cover non-Flash banner ads.In addition, neither project has broad coverage of non-English banner ads.
In this paper, we present a dataset of web banner ads that grew out of a larger ongoing research project on Chinese-language web archiving.The aim of the original project is to measure and compare the archival rate and archival quality of Chinese-and Englishlanguage web pages from the late 1990s and early 2000s on the Wayback Machine.The project used printed Internet directory books published from that time period to collect historical URLs.Before the advent of full-text search engines, printed directory books of Internet resources were popular among web users to locate content of interest online (Ankerson, 2018).Formatted after phone books, the directory books usually provide lists of web page URLs manually curated into distinct categories.Today, the URLs featured in these books provide convenient entry points for researchers to access archived web content of the past.Using URLs found in six Internet directory books published in the United States and China in the late 1990s and early 2000s, we were able to access and download archived copies of web pages at these URLs on the Wayback Machine for the original research project.Since many archived web pages we downloaded contain banner ad images, we realised that it would be possible to build a dataset of historical banner ads on the web using the downloaded archived web pages and their metadata.We then devised a technical procedure to extract banner ad images from these archived web pages to compile this dataset.1).The directory books published in mainland China feature mostly URLs of Chineselanguage web pages in both simplified and traditional Chinese and a small number of Englishlanguage web pages, while the English-language web directory books feature mostly URLs in English.
We used Wayback Machine's CDX API (https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md) to get a list of available archived snapshots on the Wayback Machine of each URL in our list of URLs.Next, we used a Selenium script to fetch all web page snapshots made before January 1, 2003 and had an HTTP response status code of 200 (indicating that the snapshot was successfully captured by the Wayback Machine).We downloaded these snapshots in MHTML format using the web browser Chromium.MHTML is a web archive format that saves the HTML source code of a web page along with its embedded resources into one single file.Image files on a web page are commonly stored as base64encoded strings in an MHTML file (Hopmann et al., 1999).If one URL had more than 50 snapshots with an HTTP 200 status code, we randomly sampled 50 snapshots to download.In total, 1,384,355 MHTML files were downloaded from the Wayback Machine.
We then used a Python script to analyse each MHTML file to extract image files (in GIF, JPEG, PNG, and BMP formats) with dimensions commonly used in banner ad images.The dimensions we used in our script are from the voluntary guidelines for Internet advertisements first   (Collins, 1996).The IAB -whose name is now Interactive Advertising Bureau -is a trade association in digital advertising whose members include major brands, websites, ad agencies, and technology providers.The IAB Internet ad size guidelines were produced as an attempt to standardise Internet ad sizes, and they played an influential role in shaping the Internet advertising landscape through standardisation, which accelerated the growth of the ad network industry (Lobato & Thomas, 2020).As we wanted to focus on conventional horizontal banner ads, we opted to use specific IAB dimensions in our image extraction process (IAB, 1996(IAB, , 2003; see Table 2).
We calculated each image file's MD5 hash to group duplicate images that appear in multiple downloaded web page snapshots.For each unique image, we logged its appearances across our entire downloaded dataset of MHTML files, and for each appearance, we logged the image's archived URL in the Wayback Machine, the original URL of the web page containing the image, the timestamp of the archived web page snapshot, and, if available, the archived URLs to which the image is linked.In total, 22,915 unique images were extracted from all downloaded MHTML files.

OCR process
To In addition to text, we included in our dataset the confidence level and bounding box data provided by the OCR engine.We did not manually check the extracted text data for accuracy, because, in our experience, the existing data is already of reasonably high quality, and researchers can easily search for banners containing certain words using the OCR data.

QUALITY CONTROL
In the OCR process, we performed file integrity checks on all GIF ad images and marked corrupted images in the dataset.The Wayback Machine may archive a corrupted image either because some kind of network error occurred when the Wayback Machine was trying to retrieve the image from the original server, or because the image was already corrupted on the original server, and the Wayback Machine was archiving the corrupted image verbatim.Corrupted images were detected in the process of separating animated images into frames using the image processing software library imagemagick.If imagemagick failed to separate an image into individual frames and reported it as corrupt, we would mark the image as corrupt in the dataset.No checks on file integrity on non-GIF images were performed, because PaddleOCR was able to process and output OCR results from all non-GIF images.

DATA STRUCTURE
The dataset is presented as a JavaScript Object Notation (JSON) file containing an array of individual banner ad images.Each object in the array represents one unique banner ad image.Each object contains the following fields: • md5: This field contains the MD5 hash value of the banner image file.It is used as a unique identifier for all banner ads in the dataset.
• width and height: These fields specify the dimensions of the banner ad in pixels.
• filetype: This field indicates the file format of the banner ad image as it was served from the original website (or the original ad provider's server) to the Wayback Machine's crawler.Possible values are gif, jpeg, bmp, and png.File type is detected by examining the first two characters of the image's base64 string.
• appearances: This is an array of objects.Each object represents one appearance of the banner ad in the downloaded collection of MHTML files along with associated details.An appearance is defined as the banner ad image located at a unique image_url (see below) appearing in a web page snapshot at a specific url archived at a specific timestamp (see below).Each object in this array contains the following fields: • url: This field provides the original URL of the web page where the banner ad was found.
• timestamp: The timestamp indicates when the web page containing the banner ad was archived on the Wayback Machine.The timestamps are in the format of "YYYYMMDDHHMMSS".The archived snapshot of the web page containing the banner ad image can be accessed at https://web.archive.org/web/{{timestamp}}/{{url}} • image_url: This field provides the archived URL to the banner ad image as it appeared in the archived snapshot of the web page captured at the time indicated in timestamp.
• hrefs: This field is an array containing archived URLs that the image would lead the user to upon clicking.In most cases, the array contains only one element.If this array contains multiple elements, it indicates that the banner image loaded from the same image_url appeared on the archived snapshot of the web page multiple times and was linked to at least two different URLs.
• ocr_result: If the image is not corrupted, ocr_result is an array containing text extracted from the image using PaddleOCR.For animated images, each object in this array represents one individual frame.For static images, there is only one object in this array, with frame_num (see below) being 0. For corrupted images, the value of ocr_result will be "corrupt".If the image is not corrupted, an object in this array contains the following fields: • frame_num: The number of the specific frame of the banner ad image that this object is representing (counting from zero).
• result: an array representing bounding boxes detected by the OCR engine on the frame.A bounding box is a rectangular area in the image containing text as detected by the OCR engine.Each object in this array contains the following fields: • text: the text detected by the OCR engine.
• confidence: the confidence score given by the OCR engine for the text detected.The value is between 0 and 1, with a higher score indicating a higher level of certainty the OCR engine has regarding the accuracy of the OCR results.
• bounding_box: an array containing 4 sub-arrays representing the coordinates of the four corners of the bounding box.
(3) DATASET DESCRIPTION (4) REUSE POTENTIAL We expect this dataset to be useful for researchers from a variety of disciplines who are interested in analysing banner ad images using different methods.By studying the evolution of banner ads in this dataset, researchers can gain insights into the changing aesthetics of online advertisements and the ways in which banner ads reflected the broader socio-cultural context of the web at the time.Since the banner ads are collected from Chinese-and English-language web pages, comparative studies of banner advertising along linguistic and cultural differences are also possible through analysis of the dataset.
Additionally, the dataset may be of interest to artists looking to create data-based artwork using the banner ads.As an example, we have created Banner Depot 2000, a website 2 where visitors can browse through the banner ads dataset, search for specific banners by keyword, and compose "found poetry" using individual frames of banner ad images in the dataset as poetry verses.Banners on the website are displayed on a fair use basis for the purpose of scholarly research and criticism, as well as transformative creative expression.

LIMITATIONS AND FUTURE WORK
We did not include the actual banner ad image files in our dataset due to copyright concerns.However, interested researchers should be able to download any image file manually using the value of the field image_url in any object under the appearances array in an ad image object in the dataset.
Given the linguistic and thematic diversity of the banner ads, we decided not to provide any kind of subjective metadata for the ads.With the OCR data as well as other types of data provided in the dataset, interested researchers should be able to categorise and classify the banner ads in different ways that make sense for their own research projects.We are also considering building a tagging and collections feature into our website (see footnote 2) where users can add tags to banners and share customised collections of banner ads, thus enabling collective annotation and filtering of the dataset.
We are not providing manually corrected OCR data in this version of the dataset, because, in our experience, the quality of the existing OCR data is already reasonably high to support keyword searching within the dataset, and performing additional manual error correction may provide marginal benefits that would not justify the potential amount of human labour required for the task.In future versions of the dataset, we will consider providing multiple versions of OCR data generated by different OCR engines in lieu of manually corrected data to help researchers who need more reliable text transcriptions of the banner ads.
Flash was a commonly used technology for displaying banner ads in the late 1990s and early 2000s.However, our dataset currently does not contain any Flash-based banner ads.This is because Chromium does not include Flash files in a web page when saving it to an MHTML file.We will consider developing a new mechanism to detect Flash ads in our collection of archived web pages and incorporate them into a future version of the dataset.
Due to technical limitations in our data curation methods, we are also unable to provide any metrics commonly used in the online advertising industry, such as impressions, click-through rates, and cost per click.
Since many banner ads are served by ad networks, which determine what ads are displayed for a user based on a variety of factors (such as the user's geographic location, the content of the web page, and the browser the user is using), the specific hardware, software, and network configurations of the Wayback Machine's crawler might have inadvertently played a role in improve reusability of the dataset, we used optical character recognition (OCR) software PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) to extract text in all banner ad images.PaddleOCR is an open-source OCR engine developed by the Chinese search engine company Baidu.In our experience, PaddleOCR delivered more accurate results for Chinese characters over other open-source OCR software, such as Tesseract.The OCR model version used by PaddleOCR in our text recognition process is PP-OCRv4.For animated GIF banner ad images, we separated the image into individual static frames, and extracted text from each frame of the image.

Table 1
Language All variable names are in English.Most banner ads are in either English or Chinese (simplified and traditional).