
files, knot, steps, interruption, freedom, integrate, scale, engineered, cloud
gsutil
$ gsutil cp data* gs://iodemo/ Omitting directory "file://data". (Did you mean to do cp -R?) Copying file://data.csv [Content-Type=text/csv]... $ gsutil cp -R data* gs://iodemo/ Copying file://data/data1.csv [Content-Type=text/csv]... Copying file://data/data4.csv [Content-Type=text/csv]... Copying file://data/data2.csv [Content-Type=text/csv]... Copying file://data/data3.csv [Content-Type=text/csv]... Copying file://data.csv [Content-Type=text/csv]... $ gsutil ls gs://iodemo/ gs://iodemo/data.csv gs://iodemo/data/
$ wget http://storage.googleapis.com/pub/gsutil.tar.gz $ tar xfz gsutil.tar.gz -C $HOME $ export PATH=${PATH}:$HOME/gsutil
$ pip install gsutil
$ gsutil help Usage: gsutil [-d][-D] [-h header]... [-m] [command [opts...] args...] [-q] <...snip...> cat Concatenate object content to stdout cp Copy files and objects ls List providers, buckets, or objects mv Move/rename objects and/or subdirectories rm Remove objects
... and a few reinterpretations
mb Make buckets rb Remove buckets
chacl Add or remove entries on bucket and/or object ACLs chdefacl Add / remove entries on bucket default ACL compose Concatenate a sequence of objects into a new composite object. config Obtain credentials and create configuration file disablelogging Disable logging on buckets enablelogging Enable logging on buckets getacl Get ACL XML for a bucket or object getcors Get a bucket's CORS XML document getdefacl Get default ACL XML for a bucket getlogging Get logging configuration for a bucket getversioning Get the versioning configuration for one or more buckets getwebcfg Get the website configuration for one or more buckets perfdiag Run performance diagnostic setacl Set bucket and/or object ACLs setcors Set a CORS XML document for one or more buckets setdefacl Set default ACL on buckets setmeta Set metadata on already uploaded objects setversioning Enable or suspend versioning for one or more buckets setwebcfg Set a main page and/or error page for one or more buckets test Run gsutil tests update Update to the latest gsutil release version Print version info about gsutil
$ # local to Cloud Storage $ gsutil cp *.txt gs://my_bucket $ # Cloud Storage to local $ gsutil cp gs://my_bucket/*.txt . $ # S3 to local $ gsutil cp s3://my_bucket/*.txt . $ # S3 to Cloud Storage $ gsutil cp s3://my_bucket/*.txt gs://my_bucket $ # Cloud Storage to S3 $ gsutil cp gs://my_bucket/*.txt s3://my_bucket
What are the main limits of gsutil cp
?
gsutil -m cp
gsutil -m cp
$ cat ~/.boto <... snip ...> # 'parallel_process_count' and 'parallel_thread_count' # specify the number of OS processes and Python threads, # respectively, to use when executing operations in parallel. #parallel_process_count = 6 #parallel_thread_count = 10
What are the main limits of gsutil -m cp
?
To upload in parallel:
$ split -n 10 big-file big-file-part- $ gsutil -m cp big-file-part-* gs://bucket/dir/ $ rm big-file-part-* $ gsutil compose gs://bucket/dir/big-file-part-* gs://bucket/dir/big-file $ gsutil -m rm gs://bucket/dir/big-file-part-*
gsutil -m cp
This solution addresses all the previous limits.
{ "kind": "storage#object", "id": "<BucketName>/<ObjectName>/1364245936170000", "selfLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/<ObjectName>", "name": "<ObjectName>", "bucket": "<BucketName>", "generation": "1364245936170000", "metageneration": "1", "contentType": "application/octet-stream", "updated": "2013-03-25T21:12:16.168Z", "size": "5", "md5Hash": "0d599f0ec05c3bda8c3b8a68c32a1b47", "mediaLink": "https://www.googleapis.com/storage/v1beta2/b/<BucketName>/o/ <ObjectName>?generation=1364245936170000&alt=media", "contentLanguage": "en", "owner": { "entity": "user-00b4902b93ceb90bd318a2723c8bd33ba238472dada82ec37701327065d2ba45", "entityId": "00b2903a17ceb30bd328a2723c4cd23ba638372dada82ec47702227035d2ca45" }, "crc32c": 491145432 }
def post(self): # pylint: disable-msg=C6409 """Process the notification event. """ event_type = self.request.headers['X-Goog-Event-Type'] if event_type == 'sync': logging.info('Sync message received.') else: an_object = json.loads(self.request.body) bucket = an_object['bucket'] object_name = an_object['name'] logging.info('%s/%s %s', bucket, object_name, event_type)
POST /batch HTTP/1.1 Content-Length: content_length content-type: multipart/mixed; boundary="===============7330845974216740156==" accept-encoding: gzip, deflate authorization: Bearer oauth2_token --===============7330845974216740156== <... snip ...> POST /storage/v1beta1/b/example-bucket/o/obj1/acl?alt=json HTTP/1.1 Content-Type: application/json accept: application/json content-length: 40 {"role": "READER", "entity": "allUsers"} --===============7330845974216740156== <... snip ...> POST /storage/v1beta1/b/example-bucket/o/obj2/acl?alt=json HTTP/1.1 Content-Type: application/json accept: application/json content-length: 40 {"role": "READER", "entity": "allUsers"} --===============7330845974216740156==
q = Queue() # snipped: code to populate q with "jobs", each is a list of URLs to delete def job_handler(): # Create an httplib2.Http object to handle our HTTP requests http = httplib2.Http() http = credentials.authorize(http) # Build a service object for interacting with the API. service = build(serviceName='storage', version='v1beta1', http=http, developerKey='{DEVELOPER_KEY}') while not q.empty(): objects = q.get() batch_remove(objects, service, http)
def batch_remove(objects, service, http): def cb(req_id, response, exception): if exception: print req_id, exception batch = apiclient.http.BatchHttpRequest() for obj in objects: batch.add(service.objects().delete(bucket=BUCKET, object=obj), callback=cb) return batch.execute(http=http)
Full source code:
https://github.com/GoogleCloudPlatform/storage-bulk-delete-python
$ gsutil help <... snip ...> Additional help topics: acls Working with Access Control Lists anon Accessing Public Data without Credentials crc32c CRC32C and Installing crcmod creds Credential Types Supporting Various Use Cases dev Contributing metadata Working with Object Metadata naming Object and Bucket Naming options Top-Level Command-Line Options prod Scripting Production Transfers projects Working with Projects subdirs How Subdirectories Work support Google Cloud Storage Support versioning Object Versioning and Concurrency Control wildcards Wildcard Names
$ ls 69c6bd8fa757d68324b1f2b768fd4db7 417040d5f4dde9c9b9ab0ae44510cb0e e3a1deb4534b8bd7bf17216eda10f1bf 70b438f200c4f1baa0db43697f1efb37 d4943c49842cf1d73a49b9c4dd3cc441 359f4d243a902bef132d2b998abd7ca4 6f5a553d40440f42860bdcc43a4cf785 <... snip thousands of files ...> $ gsutil -m cp 6* gs://my_bucket $ gsutil -m cp 7* gs://my_bucket $ gsutil -m cp 8* gs://my_bucket
When designing a process, choose useful object name prefixes.
$ some_program | gsutil -m cp -I gs://my_bucket
$ cat files.txt README.md notes.txt images/disk_import.jpg $ cat files.txt | gsutil -m cp -I gs://briandpe-scratch Copying file://README.md [Content-Type=application/octet-stream]... Copying file://notes.txt [Content-Type=text/plain]... Copying file://images/disk_import.jpg [Content-Type=image/jpeg]..
gsutil
near to your source or target
$ gsutil help