Agenda

  • What and Why
  • Small to medium imports
  • Large imports
  • Enormous imports
  • Postprocessing
  • Tips
  • Announcement
  • Q/A

What is Google Cloud Storage?


Why Google Cloud Storage?

Google Cloud Storage overview

  • Bucket/object model - up to 5 TiB/object
  • Immutable objects, strong (read-after-write) data consistency
  • Streaming uploads and resumable transfers, range read support
  • Multiple layers of redundancy. All data replicated to multiple data centers.
  • Serve static websites directly from Cloud Storage.
  • XML API: Interoperable with Amazon S3 and others.
  • OAuth 2.0 or interoperable authentication
  • Individual, project and group-level access controls. Signed URLs
  • New: Notifications, Versioning, JSON API, DRA storage, Composite objects, Cloud Console, 30% price drop


a moment to reflect


files, knot, steps, interruption, freedom, integrate, scale, engineered, cloud

Small to medium imports

Gigabytes

gsutil

gsutil cp (demo)

$ gsutil cp data* gs://iodemo/
Omitting directory "file://data". (Did you mean to do cp -R?)
Copying file://data.csv [Content-Type=text/csv]...

$ gsutil cp -R data* gs://iodemo/
Copying file://data/data1.csv [Content-Type=text/csv]...
Copying file://data/data4.csv [Content-Type=text/csv]...
Copying file://data/data2.csv [Content-Type=text/csv]...
Copying file://data/data3.csv [Content-Type=text/csv]...
Copying file://data.csv [Content-Type=text/csv]...

$ gsutil ls gs://iodemo/
gs://iodemo/data.csv
gs://iodemo/data/

How to get it

Gsutil provides a path abstraction

$ gsutil help
Usage: gsutil [-d][-D] [-h header]... [-m] [command [opts...] args...] [-q]
<...snip...>
  cat            Concatenate object content to stdout
  cp             Copy files and objects
  ls             List providers, buckets, or objects
  mv             Move/rename objects and/or subdirectories
  rm             Remove objects

... and a few reinterpretations

  mb             Make buckets
  rb             Remove buckets

Gsutil provides complete feature support

  chacl          Add or remove entries on bucket and/or object ACLs
  chdefacl       Add / remove entries on bucket default ACL
  compose        Concatenate a sequence of objects into a new composite object.
  config         Obtain credentials and create configuration file
  disablelogging Disable logging on buckets
  enablelogging  Enable logging on buckets
  getacl         Get ACL XML for a bucket or object
  getcors        Get a bucket's CORS XML document
  getdefacl      Get default ACL XML for a bucket
  getlogging     Get logging configuration for a bucket
  getversioning  Get the versioning configuration for one or more buckets
  getwebcfg      Get the website configuration for one or more buckets
  perfdiag       Run performance diagnostic
  setacl         Set bucket and/or object ACLs
  setcors        Set a CORS XML document for one or more buckets
  setdefacl      Set default ACL on buckets
  setmeta        Set metadata on already uploaded objects
  setversioning  Enable or suspend versioning for one or more buckets
  setwebcfg      Set a main page and/or error page for one or more buckets
  test           Run gsutil tests
  update         Update to the latest gsutil release
  version        Print version info about gsutil

For smaller datasets, gsutil cp is enough

$ # local to Cloud Storage
$ gsutil cp *.txt gs://my_bucket

$ # Cloud Storage to local
$ gsutil cp gs://my_bucket/*.txt .

$ # S3 to local
$ gsutil cp s3://my_bucket/*.txt .

$ # S3 to Cloud Storage
$ gsutil cp s3://my_bucket/*.txt gs://my_bucket

$ # Cloud Storage to S3
$ gsutil cp gs://my_bucket/*.txt s3://my_bucket

Limits

What are the main limits of gsutil cp?

  • Network latency

Large imports

10s of Gigabytes -> 10s of Terabytes

gsutil -m cp

 
gsutil -m cp

multi-processing and multi-threading

Configuring -m

$ cat ~/.boto

<... snip ...>

      # 'parallel_process_count' and 'parallel_thread_count' 
      # specify the number of OS processes and Python threads, 
      # respectively, to use when executing operations in parallel.

      #parallel_process_count = 6
      #parallel_thread_count = 10

Limits

What are the main limits of gsutil -m cp?

  • Network bandwidth
  • Disk I/O
  • Coordination / complexity

Problem

copying a single large file
isn't maxing out upstream bandwidth

Use object composition to improve throughput

To upload in parallel:

  • Split your file into smaller pieces
  • Upload them using "gsutil -m cp"
  • Compose the results
  • Delete the pieces

$ split -n 10 big-file big-file-part-
$ gsutil -m cp big-file-part-* gs://bucket/dir/
$ rm big-file-part-*
$ gsutil compose gs://bucket/dir/big-file-part-* gs://bucket/dir/big-file
$ gsutil -m rm gs://bucket/dir/big-file-part-*

Enormous imports

100s of Terabytes -> Petabytes

gsutil -m cp

(running on multiple machines)
5 Petabytes in 5 Weeks
Description

Results

  • Over 5 Petabytes of data in approximately 4 billion objects
  • Live migration
  • Averaged over 10 Gbits/sec, with peaks above 20Gbits/sec

Limits addressed

This solution addresses all the previous limits.

  • Network bandwidth
  • Disk I/O
  • Coordination / complexity

Postprocessing

Notifications and JSON batch requests

{
 "kind": "storage#object",
 "id": "<BucketName>/<ObjectName>/1364245936170000",
 "selfLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/<ObjectName>",
 "name": "<ObjectName>",
 "bucket": "<BucketName>",
 "generation": "1364245936170000",
 "metageneration": "1",
 "contentType": "application/octet-stream",
 "updated": "2013-03-25T21:12:16.168Z",
 "size": "5",
 "md5Hash": "0d599f0ec05c3bda8c3b8a68c32a1b47",
 "mediaLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/
               <ObjectName>?generation=1364245936170000&alt=media",
 "contentLanguage": "en",
 "owner": {
  "entity": "user-00b4902b93ceb90bd318a2723c8bd33ba238472dada82ec37701327065d2ba45",
  "entityId": "00b2903a17ceb30bd328a2723c4cd23ba638372dada82ec47702227035d2ca45"
 },
 "crc32c": 491145432
}

Notifications - example App Engine application

  def post(self):  # pylint: disable-msg=C6409
    """Process the notification event.  """

    event_type = self.request.headers['X-Goog-Event-Type']
    if event_type == 'sync':
      logging.info('Sync message received.')
    else:
      an_object = json.loads(self.request.body)
      bucket = an_object['bucket']
      object_name = an_object['name']
      logging.info('%s/%s %s', bucket, object_name, event_type)
updating the ACL on 500,000 objects
or
deleting staging objects

JSON API batch requests

POST /batch HTTP/1.1
Content-Length: content_length
content-type: multipart/mixed; boundary="===============7330845974216740156=="
accept-encoding: gzip, deflate
authorization: Bearer oauth2_token


--===============7330845974216740156==
<... snip ...>
POST /storage/v1beta1/b/example-bucket/o/obj1/acl?alt=json HTTP/1.1
Content-Type: application/json
accept: application/json
content-length: 40

{"role": "READER", "entity": "allUsers"}
--===============7330845974216740156==
<... snip ...>
POST /storage/v1beta1/b/example-bucket/o/obj2/acl?alt=json HTTP/1.1
Content-Type: application/json
accept: application/json
content-length: 40

{"role": "READER", "entity": "allUsers"}
--===============7330845974216740156==

JSON API batch requests

q = Queue()
# snipped: code to populate q with "jobs", each is a list of URLs to delete

def job_handler():
  # Create an httplib2.Http object to handle our HTTP requests 
  http = httplib2.Http()
  http = credentials.authorize(http)
  
  # Build a service object for interacting with the API. 
  service = build(serviceName='storage', version='v1beta1', http=http,
                  developerKey='{DEVELOPER_KEY}')


  while not q.empty():
    objects = q.get()
    batch_remove(objects, service, http)

JSON API batch requests

def batch_remove(objects, service, http):
  def cb(req_id, response, exception):
    if exception:
      print req_id, exception

  batch = apiclient.http.BatchHttpRequest()

  for obj in objects:
    batch.add(service.objects().delete(bucket=BUCKET, object=obj), callback=cb)
  return batch.execute(http=http)


Full source code:
https://github.com/GoogleCloudPlatform/storage-bulk-delete-python

Tips

Tip

$ gsutil help
<... snip ...>
Additional help topics:
  acls           Working with Access Control Lists
  anon           Accessing Public Data without Credentials
  crc32c         CRC32C and Installing crcmod
  creds          Credential Types Supporting Various Use Cases
  dev            Contributing
  metadata       Working with Object Metadata
  naming         Object and Bucket Naming
  options        Top-Level Command-Line Options
  prod           Scripting Production Transfers
  projects       Working with Projects
  subdirs        How Subdirectories Work
  support        Google Cloud Storage Support
  versioning     Object Versioning and Concurrency Control
  wildcards      Wildcard Names

Tip

partition work by prefix

Partition with prefixes

$ ls
69c6bd8fa757d68324b1f2b768fd4db7
417040d5f4dde9c9b9ab0ae44510cb0e
e3a1deb4534b8bd7bf17216eda10f1bf
70b438f200c4f1baa0db43697f1efb37
d4943c49842cf1d73a49b9c4dd3cc441
359f4d243a902bef132d2b998abd7ca4
6f5a553d40440f42860bdcc43a4cf785
<... snip thousands of files ...>

$ gsutil -m cp 6* gs://my_bucket
$ gsutil -m cp 7* gs://my_bucket
$ gsutil -m cp 8* gs://my_bucket

When designing a process, choose useful object name prefixes.

Problem

copying a specific list of files

from a text file or program output

gsutil cp -I

$ some_program | gsutil -m cp -I gs://my_bucket
$ cat files.txt 
README.md
notes.txt
images/disk_import.jpg

$ cat files.txt | gsutil -m cp -I gs://briandpe-scratch               
Copying file://README.md [Content-Type=application/octet-stream]...
Copying file://notes.txt [Content-Type=text/plain]...
Copying file://images/disk_import.jpg [Content-Type=image/jpeg]..

Tip

run gsutil near to your source or target

Announcement

Offline Disk Import Process

  • Format a SATA disk with encfs
  • Copy your data to the disk
  • Mail it to Google
  • Google imports it into a new bucket owned by the customer
  • Google mails the disk back

Offline Disk Import Limited Preview

Recent feature releases

  • Versioning
  • Durable Reduced Availability storage
  • 30% price drop

  • Cloud Console
  • Composite objects
  • Notifications
  • JSON API
  • Offline disk import

www.flickr.com/photos/25797459@N06/5438799763/

Thank you!   Questions?