Importing Large Datasets into Google Cloud Storage

Agenda

What and Why
Small to medium imports
Large imports
Enormous imports
Postprocessing
Tips
Announcement
Q/A

What is Google Cloud Storage?

Why Google Cloud Storage?

Google Cloud Storage overview

Bucket/object model - up to 5 TiB/object
Immutable objects, strong (read-after-write) data consistency
Streaming uploads and resumable transfers, range read support
Multiple layers of redundancy. All data replicated to multiple data centers.
Serve static websites directly from Cloud Storage.
XML API: Interoperable with Amazon S3 and others.
OAuth 2.0 or interoperable authentication
Individual, project and group-level access controls. Signed URLs
New: Notifications, Versioning, JSON API, DRA storage, Composite objects, Cloud Console, 30% price drop

a moment to reflect

files, knot, steps, interruption, freedom, integrate, scale, engineered, cloud

I'd like to take a moment and reflect on the awesomeness that is Google Cloud Storage.
You’ve all used, and probably written applications which store data in files. It’s pretty easy to get started and things work well for a while. Pretty quickly, you start running into limitations... on the number of files in a directory... OK, easy, add more directories. Then a drive fills up - so you add a drive... and then another and another. At some point the host hardware is maxed out and you add another machine... you need redundancy? get a bunch more machines and copy everything over... and set something up to sync the data. Pretty soon you have this big knot of complication on your architectural diagram. By now you’ve probably got enough practice with this dance, so these steps are not *too* difficult on their own. However, every step of the way you’re interrupted and forced to add a new piece - when you could be adding features or enjoying the weekend instead. Google Cloud Storage frees you from this cycle and replaces that knot with a single service hosted on Google infrastructure. You’ll need to do some work to integrate with the APIs, and then you can get back to implementing amazing features for your users. Google Cloud Storage scales - we have developers with billions of objects in a bucket, and others with many petabytes of data. It is engineered for reliability, durability, and speed that just works. It is also a gateway into the rest of the Google Cloud Platform - with connections to App Engine, Big Query and Compute Engine. OK, are you all inspired? Let’s import some data.

Small to medium imports

Gigabytes

gsutil

gsutil cp (demo)

$ gsutil cp data* gs://iodemo/
Omitting directory "file://data". (Did you mean to do cp -R?)
Copying file://data.csv [Content-Type=text/csv]...

$ gsutil cp -R data* gs://iodemo/
Copying file://data/data1.csv [Content-Type=text/csv]...
Copying file://data/data4.csv [Content-Type=text/csv]...
Copying file://data/data2.csv [Content-Type=text/csv]...
Copying file://data/data3.csv [Content-Type=text/csv]...
Copying file://data.csv [Content-Type=text/csv]...

$ gsutil ls gs://iodemo/
gs://iodemo/data.csv
gs://iodemo/data/

How to get it

ZIP file or tar.gz (includes dependencies) https://developers.google.com/storage/docs/gsutil_install

Cloud SDK (developer preview):
http://developers.google.com/cloud/sdk/

$ wget http://storage.googleapis.com/pub/gsutil.tar.gz
$ tar xfz gsutil.tar.gz -C $HOME 
$ export PATH=${PATH}:$HOME/gsutil

Python packaging (experimental)

$ pip install gsutil

Gsutil provides a path abstraction

$ gsutil help
Usage: gsutil [-d][-D] [-h header]... [-m] [command [opts...] args...] [-q]
<...snip...>
  cat            Concatenate object content to stdout
  cp             Copy files and objects
  ls             List providers, buckets, or objects
  mv             Move/rename objects and/or subdirectories
  rm             Remove objects

... and a few reinterpretations

  mb             Make buckets
  rb             Remove buckets

Gsutil provides complete feature support

  chacl          Add or remove entries on bucket and/or object ACLs
  chdefacl       Add / remove entries on bucket default ACL
  compose        Concatenate a sequence of objects into a new composite object.
  config         Obtain credentials and create configuration file
  disablelogging Disable logging on buckets
  enablelogging  Enable logging on buckets
  getacl         Get ACL XML for a bucket or object
  getcors        Get a bucket's CORS XML document
  getdefacl      Get default ACL XML for a bucket
  getlogging     Get logging configuration for a bucket
  getversioning  Get the versioning configuration for one or more buckets
  getwebcfg      Get the website configuration for one or more buckets
  perfdiag       Run performance diagnostic
  setacl         Set bucket and/or object ACLs
  setcors        Set a CORS XML document for one or more buckets
  setdefacl      Set default ACL on buckets
  setmeta        Set metadata on already uploaded objects
  setversioning  Enable or suspend versioning for one or more buckets
  setwebcfg      Set a main page and/or error page for one or more buckets
  test           Run gsutil tests
  update         Update to the latest gsutil release
  version        Print version info about gsutil

For smaller datasets, gsutil cp is enough

$ # local to Cloud Storage
$ gsutil cp *.txt gs://my_bucket

$ # Cloud Storage to local
$ gsutil cp gs://my_bucket/*.txt .

$ # S3 to local
$ gsutil cp s3://my_bucket/*.txt .

$ # S3 to Cloud Storage
$ gsutil cp s3://my_bucket/*.txt gs://my_bucket

$ # Cloud Storage to S3
$ gsutil cp gs://my_bucket/*.txt s3://my_bucket

Limits

What are the main limits of gsutil cp?

Network latency

Large imports

10s of Gigabytes -> 10s of Terabytes

gsutil -m cp

gsutil -m cp

multi-processing and multi-threading

Configuring -m

$ cat ~/.boto

<... snip ...>

      # 'parallel_process_count' and 'parallel_thread_count' 
      # specify the number of OS processes and Python threads, 
      # respectively, to use when executing operations in parallel.

      #parallel_process_count = 6
      #parallel_thread_count = 10

Limits

What are the main limits of gsutil -m cp?

Network bandwidth
Disk I/O
Coordination / complexity

network - Google has heavily investing in our internet connectivity to support products like YouTube and Gmail, and your Cloud Storage communication takes advantage of that investment. Our networking team works hard to get internet traffic onto our dedicated infrastructure as quickly as possible. Often we can saturate your available network bandwidth with gsutil -m.
disk - many of you have excellent networks - once you have enough parallel reads, the local disk can become the bottleneck
bit vague, but as number of objects grows, the complexity of figuring out what to do next depending on where they are stored, it can be slow listing source paths, resuming, etc or, diffing between source / destination

Problem

copying a single large file
isn't maxing out upstream bandwidth

Use object composition to improve throughput

To upload in parallel:

Split your file into smaller pieces
Upload them using "gsutil -m cp"
Compose the results
Delete the pieces

$ split -n 10 big-file big-file-part-
$ gsutil -m cp big-file-part-* gs://bucket/dir/
$ rm big-file-part-*
$ gsutil compose gs://bucket/dir/big-file-part-* gs://bucket/dir/big-file
$ gsutil -m rm gs://bucket/dir/big-file-part-*

Enormous imports

100s of Terabytes -> Petabytes

gsutil -m cp

(running on multiple machines)

5 Petabytes in 5 Weeks

Results

Over 5 Petabytes of data in approximately 4 billion objects
Live migration
Averaged over 10 Gbits/sec, with peaks above 20Gbits/sec

Limits addressed

This solution addresses all the previous limits.

Network bandwidth
Disk I/O
Coordination / complexity

Postprocessing

Notifications and JSON batch requests

{
 "kind": "storage#object",
 "id": "<BucketName>/<ObjectName>/1364245936170000",
 "selfLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/<ObjectName>",
 "name": "<ObjectName>",
 "bucket": "<BucketName>",
 "generation": "1364245936170000",
 "metageneration": "1",
 "contentType": "application/octet-stream",
 "updated": "2013-03-25T21:12:16.168Z",
 "size": "5",
 "md5Hash": "0d599f0ec05c3bda8c3b8a68c32a1b47",
 "mediaLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/
               <ObjectName>?generation=1364245936170000&alt=media",
 "contentLanguage": "en",
 "owner": {
  "entity": "user-00b4902b93ceb90bd318a2723c8bd33ba238472dada82ec37701327065d2ba45",
  "entityId": "00b2903a17ceb30bd328a2723c4cd23ba638372dada82ec47702227035d2ca45"
 },
 "crc32c": 491145432
}

Notifications - example App Engine application

  def post(self):  # pylint: disable-msg=C6409
    """Process the notification event.  """

    event_type = self.request.headers['X-Goog-Event-Type']
    if event_type == 'sync':
      logging.info('Sync message received.')
    else:
      an_object = json.loads(self.request.body)
      bucket = an_object['bucket']
      object_name = an_object['name']
      logging.info('%s/%s %s', bucket, object_name, event_type)

updating the ACL on 500,000 objects
or
deleting staging objects

JSON API batch requests

POST /batch HTTP/1.1
Content-Length: content_length
content-type: multipart/mixed; boundary="===============7330845974216740156=="
accept-encoding: gzip, deflate
authorization: Bearer oauth2_token


--===============7330845974216740156==
<... snip ...>
POST /storage/v1beta1/b/example-bucket/o/obj1/acl?alt=json HTTP/1.1
Content-Type: application/json
accept: application/json
content-length: 40

{"role": "READER", "entity": "allUsers"}
--===============7330845974216740156==
<... snip ...>
POST /storage/v1beta1/b/example-bucket/o/obj2/acl?alt=json HTTP/1.1
Content-Type: application/json
accept: application/json
content-length: 40

{"role": "READER", "entity": "allUsers"}
--===============7330845974216740156==

JSON API batch requests

q = Queue()
# snipped: code to populate q with "jobs", each is a list of URLs to delete

def job_handler():
  # Create an httplib2.Http object to handle our HTTP requests 
  http = httplib2.Http()
  http = credentials.authorize(http)
  
  # Build a service object for interacting with the API. 
  service = build(serviceName='storage', version='v1beta1', http=http,
                  developerKey='{DEVELOPER_KEY}')


  while not q.empty():
    objects = q.get()
    batch_remove(objects, service, http)

JSON API batch requests

def batch_remove(objects, service, http):
  def cb(req_id, response, exception):
    if exception:
      print req_id, exception

  batch = apiclient.http.BatchHttpRequest()

  for obj in objects:
    batch.add(service.objects().delete(bucket=BUCKET, object=obj), callback=cb)
  return batch.execute(http=http)

Full source code:
https://github.com/GoogleCloudPlatform/storage-bulk-delete-python

Tips

Tip

$ gsutil help
<... snip ...>
Additional help topics:
  acls           Working with Access Control Lists
  anon           Accessing Public Data without Credentials
  crc32c         CRC32C and Installing crcmod
  creds          Credential Types Supporting Various Use Cases
  dev            Contributing
  metadata       Working with Object Metadata
  naming         Object and Bucket Naming
  options        Top-Level Command-Line Options
  prod           Scripting Production Transfers
  projects       Working with Projects
  subdirs        How Subdirectories Work
  support        Google Cloud Storage Support
  versioning     Object Versioning and Concurrency Control
  wildcards      Wildcard Names

Tip

partition work by prefix

Partition with prefixes

$ ls
69c6bd8fa757d68324b1f2b768fd4db7
417040d5f4dde9c9b9ab0ae44510cb0e
e3a1deb4534b8bd7bf17216eda10f1bf
70b438f200c4f1baa0db43697f1efb37
d4943c49842cf1d73a49b9c4dd3cc441
359f4d243a902bef132d2b998abd7ca4
6f5a553d40440f42860bdcc43a4cf785
<... snip thousands of files ...>

$ gsutil -m cp 6* gs://my_bucket
$ gsutil -m cp 7* gs://my_bucket
$ gsutil -m cp 8* gs://my_bucket

When designing a process, choose useful object name prefixes.

Problem

copying a specific list of files

from a text file or program output

gsutil cp -I

$ some_program | gsutil -m cp -I gs://my_bucket

$ cat files.txt 
README.md
notes.txt
images/disk_import.jpg

$ cat files.txt | gsutil -m cp -I gs://briandpe-scratch               
Copying file://README.md [Content-Type=application/octet-stream]...
Copying file://notes.txt [Content-Type=text/plain]...
Copying file://images/disk_import.jpg [Content-Type=image/jpeg]..

Tip

run gsutil near to your source or target

Announcement

Offline Disk Import Process

Format a SATA disk with encfs
Copy your data to the disk
Mail it to Google
Google imports it into a new bucket owned by the customer
Google mails the disk back

Offline Disk Import Limited Preview

United States addresses only for now
Interest form:
https://developers.google.com/storage/docs/early-access
(http://bit.ly/disk_import)

Recent feature releases

Versioning
Durable Reduced Availability storage
30% price drop

Cloud Console
Composite objects
Notifications
JSON API
Offline disk import

Thank you! Questions?

$ gsutil help
Documentation: http://developers.google.com/storage
Resources: /storage/docs/resources-support
Support: /storage/docs/pricingandterms
Disk import interest: /storage/docs/early-access
These slides: http://storage.googleapis.com/io13/gcs_import/index.html
(http://bit.ly/import_gcs)

briandorsey@google.com – http://google.com/+BrianDorsey
barth@google.com