Question or problem about Python programming:
Using boto3, I can access my AWS S3 bucket:
s3 = boto3.resource('s3') bucket = s3.Bucket('my-bucket-name')
Now, the bucket contains folder first-level, which itself contains several sub-folders named with a timestamp, for instance 1456753904534.
I need to know the name of these sub-folders for another job I’m doing and I wonder whether I could have boto3 retrieve those for me.
So I tried:
objs = bucket.meta.client.list_objects(Bucket='my-bucket-name')
which gives a dictionary, whose key ‘Contents’ gives me all the third-level files instead of the second-level timestamp directories, in fact I get a list containing things as
you can see that the specific files, in this case part-00014 are retrieved, while I’d like to get the name of the directory alone.
In principle I could strip out the directory name from all the paths but it’s ugly and expensive to retrieve everything at third level to get the second level!
I also tried something reported here:
for o in bucket.objects.filter(Delimiter='/'): print(o.key)
but I do not get the folders at the desired level.
Is there a way to solve this?
How to solve the problem:
Solution 1:
S3 is an object storage, it doesn’t have real directory structure. The “/” is rather cosmetic.
One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.
To manipulate object in S3, you need boto3.client or boto3.resource, e.g.
To list all object
import boto3 s3 = boto3.client("s3") all_objects = s3.list_objects(Bucket = 'bucket-name')
http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects
In fact, if the s3 object name is stored using ‘/’ separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.
To limit the items to items under certain sub-folders:
import boto3 s3 = boto3.client("s3") response = s3.list_objects_v2( Bucket=BUCKET, Prefix ='DIR1/DIR2', MaxKeys=100 )
Documentation
Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.
import os s3_key = 'first-level/1456753904534/part-00014' filename = os.path.basename(s3_key) foldername = os.path.dirname(s3_key) # if you are not using conventional delimiter like '#' s3_key = 'first-level#1456753904534#part-00014 filename = s3_key.split("#")[-1]
A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.
Solution 2:
Below piece of code returns ONLY the ‘subfolders’ in a ‘folder’ from s3 bucket.
import boto3 bucket = 'my-bucket' #Make sure you provide / in the end prefix = 'prefix-name-with-slash/' client = boto3.client('s3') result = client.list_objects(Bucket=bucket, Prefix=prefix, Delimiter='/') for o in result.get('CommonPrefixes'): print 'sub folder : ', o.get('Prefix')
For more details, you can refer to https://github.com/boto/boto3/issues/134
Solution 3:
Short answer:
-
Use
Delimiter='/'
. This avoids doing a recursive listing of your bucket. Some answers here wrongly suggest doing a full listing and using some string manipulation to retrieve the directory names. This could be horribly inefficient. Remember that S3 has virtually no limit on the number of objects a bucket can contain. So, imagine that, betweenbar/
andfoo/
, you have a trillion objects: you would wait a very long time to get['bar/', 'foo/']
. -
Use
Paginators
. For the same reason (S3 is an engineer’s approximation of infinity), you must list through pages and avoid storing all the listing in memory. Instead, consider your “lister” as an iterator, and handle the stream it produces. -
Use
boto3.client
, notboto3.resource
. Theresource
version doesn’t seem to handle well theDelimiter
option. If you have a resource, say abucket = boto3.resource('s3').Bucket(name)
, you can get the corresponding client with:bucket.meta.client
.
Long answer:
The following is an iterator that I use for simple buckets (no version handling).
import boto3 from collections import namedtuple from operator import attrgetter S3Obj = namedtuple('S3Obj', ['key', 'mtime', 'size', 'ETag']) def s3list(bucket, path, start=None, end=None, recursive=True, list_dirs=True, list_objs=True, limit=None): """ Iterator that lists a bucket's objects under path, (optionally) starting with start and ending before end. If recursive is False, then list only the "depth=0" items (dirs and objects). If recursive is True, then list recursively all objects (no dirs). Args: bucket: a boto3.resource('s3').Bucket(). path: a directory in the bucket. start: optional: start key, inclusive (may be a relative path under path, or absolute in the bucket) end: optional: stop key, exclusive (may be a relative path under path, or absolute in the bucket) recursive: optional, default True. If True, lists only objects. If False, lists only depth 0 "directories" and objects. list_dirs: optional, default True. Has no effect in recursive listing. On non-recursive listing, if False, then directories are omitted. list_objs: optional, default True. If False, then directories are omitted. limit: optional. If specified, then lists at most this many items. Returns: an iterator of S3Obj. Examples: # set up >>> s3 = boto3.resource('s3') ... bucket = s3.Bucket(name) # iterate through all S3 objects under some dir >>> for p in s3ls(bucket, 'some/dir'): ... print(p) # iterate through up to 20 S3 objects under some dir, starting with foo_0010 >>> for p in s3ls(bucket, 'some/dir', limit=20, start='foo_0010'): ... print(p) # non-recursive listing under some dir: >>> for p in s3ls(bucket, 'some/dir', recursive=False): ... print(p) # non-recursive listing under some dir, listing only dirs: >>> for p in s3ls(bucket, 'some/dir', recursive=False, list_objs=False): ... print(p) """ kwargs = dict() if start is not None: if not start.startswith(path): start = os.path.join(path, start) # note: need to use a string just smaller than start, because # the list_object API specifies that start is excluded (the first # result is *after* start). kwargs.update(Marker=__prev_str(start)) if end is not None: if not end.startswith(path): end = os.path.join(path, end) if not recursive: kwargs.update(Delimiter='/') if not path.endswith('/'): path += '/' kwargs.update(Prefix=path) if limit is not None: kwargs.update(PaginationConfig={'MaxItems': limit}) paginator = bucket.meta.client.get_paginator('list_objects') for resp in paginator.paginate(Bucket=bucket.name, **kwargs): q = [] if 'CommonPrefixes' in resp and list_dirs: q = [S3Obj(f['Prefix'], None, None, None) for f in resp['CommonPrefixes']] if 'Contents' in resp and list_objs: q += [S3Obj(f['Key'], f['LastModified'], f['Size'], f['ETag']) for f in resp['Contents']] # note: even with sorted lists, it is faster to sort(a+b) # than heapq.merge(a, b) at least up to 10K elements in each list q = sorted(q, key=attrgetter('key')) if limit is not None: q = q[:limit] limit -= len(q) for p in q: if end is not None and p.key >= end: return yield p def __prev_str(s): if len(s) == 0: return s s, c = s[:-1], ord(s[-1]) if c > 0: s += chr(c - 1) s += ''.join(['\u7FFF' for _ in range(10)]) return s
Test:
The following is helpful to test the behavior of the paginator
and list_objects
. It creates a number of dirs and files. Since the pages are up to 1000 entries, we use a multiple of that for dirs and files. dirs
contains only directories (each having one object). mixed
contains a mix of dirs and objects, with a ratio of 2 objects for each dir (plus one object under dir, of course; S3 stores only objects).
import concurrent def genkeys(top='tmp/test', n=2000): for k in range(n): if k % 100 == 0: print(k) for name in [ os.path.join(top, 'dirs', f'{k:04d}_dir', 'foo'), os.path.join(top, 'mixed', f'{k:04d}_dir', 'foo'), os.path.join(top, 'mixed', f'{k:04d}_foo_a'), os.path.join(top, 'mixed', f'{k:04d}_foo_b'), ]: yield name with concurrent.futures.ThreadPoolExecutor(max_workers=32) as executor: executor.map(lambda name: bucket.put_object(Key=name, Body='hi\n'.encode()), genkeys())
The resulting structure is:
./dirs/0000_dir/foo ./dirs/0001_dir/foo ./dirs/0002_dir/foo ... ./dirs/1999_dir/foo ./mixed/0000_dir/foo ./mixed/0000_foo_a ./mixed/0000_foo_b ./mixed/0001_dir/foo ./mixed/0001_foo_a ./mixed/0001_foo_b ./mixed/0002_dir/foo ./mixed/0002_foo_a ./mixed/0002_foo_b ... ./mixed/1999_dir/foo ./mixed/1999_foo_a ./mixed/1999_foo_b
With a little bit of doctoring of the code given above for s3list
to inspect the responses from the paginator
, you can observe some fun facts:
-
The
Marker
is really exclusive. GivenMarker=topdir + 'mixed/0500_foo_a'
will make the listing start after that key (as per the AmazonS3 API), i.e., with.../mixed/0500_foo_b
. That’s the reason for__prev_str()
. -
Using
Delimiter
, when listingmixed/
, each response from thepaginator
contains 666 keys and 334 common prefixes. It’s pretty good at not building enormous responses. -
By contrast, when listing
dirs/
, each response from thepaginator
contains 1000 common prefixes (and no keys). -
Passing a limit in the form of
PaginationConfig={'MaxItems': limit}
limits only the number of keys, not the common prefixes. We deal with that by further truncating the stream of our iterator.
Solution 4:
It took me a lot of time to figure out, but finally here is a simple way to list contents of a subfolder in S3 bucket using boto3. Hope it helps
prefix = "folderone/foldertwo/" s3 = boto3.resource('s3') bucket = s3.Bucket(name="bucket_name_here") FilesNotFound = True for obj in bucket.objects.filter(Prefix=prefix): print('{0}:{1}'.format(bucket.name, obj.key)) FilesNotFound = False if FilesNotFound: print("ALERT", "No file in {0}/{1}".format(bucket, prefix))
Solution 5:
The big realisation with S3 is that there are no folders/directories just keys. The apparent folder structure is just prepended to the filename to become the ‘Key’, so to list the contents of myBucket
‘s some/path/to/the/file/
you can try:
s3 = boto3.client('s3') for obj in s3.list_objects_v2(Bucket="myBucket", Prefix="some/path/to/the/file/")['Contents']: print(obj['Key'])
which would give you something like:
some/path/to/the/file/yo.jpg some/path/to/the/file/meAndYou.gif ...
Solution 6:
I had the same issue but managed to resolve it using boto3.client
and list_objects_v2
with Bucket
and StartAfter
parameters.
s3client = boto3.client('s3') bucket = 'my-bucket-name' startAfter = 'firstlevelFolder/secondLevelFolder' theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter ) for object in theobjects['Contents']: print object['Key']
The output result for the code above would display the following:
firstlevelFolder/secondLevelFolder/item1 firstlevelFolder/secondLevelFolder/item2
Boto3 list_objects_v2 Documentation
In order to strip out only the directory name for secondLevelFolder
I just used python method split()
:
s3client = boto3.client('s3') bucket = 'my-bucket-name' startAfter = 'firstlevelFolder/secondLevelFolder' theobjects = s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter ) for object in theobjects['Contents']: direcoryName = object['Key'].encode("string_escape").split('/') print direcoryName[1]
The output result for the code above would display the following:
secondLevelFolder secondLevelFolder
Python split() Documentation
If you’d like to get the directory name AND contents item name then replace the print line with the following:
print "{}/{}".format(fileName[1], fileName[2])
And the following will be output:
secondLevelFolder/item2 secondLevelFolder/item2
Hope this helps
Solution 7:
The following works for me… S3 objects:
s3://bucket/ form1/ section11/ file111 file112 section12/ file121 form2/ section21/ file211 file112 section22/ file221 file222 ... ... ...
Using:
from boto3.session import Session s3client = session.client('s3') resp = s3client.list_objects(Bucket=bucket, Prefix='', Delimiter="/") forms = [x['Prefix'] for x in resp['CommonPrefixes']]
we get:
form1/ form2/ ...
With:
resp = s3client.list_objects(Bucket=bucket, Prefix='form1/', Delimiter="/") sections = [x['Prefix'] for x in resp['CommonPrefixes']]
we get:
form1/section11/ form1/section12/
Solution 8:
The AWS cli does this (presumably without fetching and iterating through all keys in the bucket) when you run aws s3 ls s3://my-bucket/
, so I figured there must be a way using boto3.
https://github.com/aws/aws-cli/blob/0fedc4c1b6a7aee13e2ed10c3ada778c702c22c3/awscli/customizations/s3/subcommands.py#L499
It looks like they indeed use Prefix and Delimiter – I was able to write a function that would get me all directories at the root level of a bucket by modifying that code a bit:
def list_folders_in_bucket(bucket): paginator = boto3.client('s3').get_paginator('list_objects') folders = [] iterator = paginator.paginate(Bucket=bucket, Prefix='', Delimiter='/', PaginationConfig={'PageSize': None}) for response_data in iterator: prefixes = response_data.get('CommonPrefixes', []) for prefix in prefixes: prefix_name = prefix['Prefix'] if prefix_name.endswith('/'): folders.append(prefix_name.rstrip('/')) return folders
Solution 9:
Here is a possible solution:
def download_list_s3_folder(my_bucket,my_folder): import boto3 s3 = boto3.client('s3') response = s3.list_objects_v2( Bucket=my_bucket, Prefix=my_folder, MaxKeys=1000) return [item["Key"] for item in response['Contents']]
Solution 10:
Using boto3.resource
This builds upon the answer by itz-azhar to apply an optional limit
. It is obviously substantially simpler to use than the boto3.client
version.
import logging from typing import List, Optional import boto3 from boto3_type_annotations.s3 import ObjectSummary # pip install boto3_type_annotations log = logging.getLogger(__name__) _S3_RESOURCE = boto3.resource("s3") def s3_list(bucket_name: str, prefix: str, *, limit: Optional[int] = None) -> List[ObjectSummary]: """Return a list of S3 object summaries.""" # Ref: https://stackoverflow.com/a/57718002/ return list(_S3_RESOURCE.Bucket(bucket_name).objects.limit(count=limit).filter(Prefix=prefix)) if __name__ == "__main__": s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)
Using boto3.client
This uses list_objects_v2
and builds upon the answer by CpILL to allow retrieving more than 1000 objects.
import logging from typing import cast, List import boto3 log = logging.getLogger(__name__) _S3_CLIENT = boto3.client("s3") def s3_list(bucket_name: str, prefix: str, *, limit: int = cast(int, float("inf"))) -> List[dict]: """Return a list of S3 object summaries.""" # Ref: https://stackoverflow.com/a/57718002/ contents: List[dict] = [] continuation_token = None if limit <= 0: return contents while True: max_keys = min(1000, limit - len(contents)) request_kwargs = {"Bucket": bucket_name, "Prefix": prefix, "MaxKeys": max_keys} if continuation_token: log.info( # type: ignore "Listing %s objects in s3://%s/%s using continuation token ending with %s with %s objects listed thus far.", max_keys, bucket_name, prefix, continuation_token[-6:], len(contents)) # pylint: disable=unsubscriptable-object response = _S3_CLIENT.list_objects_v2(**request_kwargs, ContinuationToken=continuation_token) else: log.info("Listing %s objects in s3://%s/%s with %s objects listed thus far.", max_keys, bucket_name, prefix, len(contents)) response = _S3_CLIENT.list_objects_v2(**request_kwargs) assert response["ResponseMetadata"]["HTTPStatusCode"] == 200 contents.extend(response["Contents"]) is_truncated = response["IsTruncated"] if (not is_truncated) or (len(contents) >= limit): break continuation_token = response["NextContinuationToken"] assert len(contents) <= limit log.info("Returning %s objects from s3://%s/%s.", len(contents), bucket_name, prefix) return contents if __name__ == "__main__": s3_list("noaa-gefs-pds", "gefs.20190828/12/pgrb2a", limit=10_000)