Google Takeout Data Structure
This page describes the Google Takeout data structure.
Contents
Google+ Stream[edit]
Posts[edit]
Filename[edit]
For each Activity resource (the technical term the API reference uses for posts) a separate JSON file is created. Its filename appears to be named according to the following structure:
YYYYMMDD - UNIQUE_POST_TITLE.json
or, represented as a Regular Expression:
(?<year>[0-9]{4})(?<month>[0-9]{2})(?<day>[0-9]{2}) - (?<unique_post_title>.{0-42})\.json
where YYYYMMDD
is the Activity's creation date consisting of a 4 digit year, 2 digit month and 2 digit day of month, and UNIQUE_POST_TITLE
is a unique (within the scope of the same day) Activity identifier that's at most 42 characters long.
The UNIQUE_POST_TITLE
generally is the first 39 characters of the Activity's content.
However, if this would cause a duplicate filename within the same day, the last characters of this fragment will be replaced with a (\d+)
suffix (e.g. (1)
) where the number between the parentheses is an integer which starts at 1, and gets incremented by 1 till the suffixed title fragment is unique again within the scope of the day.
Data[edit]
The data is a standard JSON file. Its structure can be visualised as a Hash or associative array as follows:
{ 'url': 'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID', 'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz', 'updateTime': 'YYYY-mm-dd HH:MM:SS:zzzz', 'author': { 'displayName': 'DISPLAY_NAME_WITHOUT_NICKNAME', 'profilePageUrl': 'https://plus.google.com/ACTIVITY_USER_ID', 'avatarImageUrl': 'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg', 'resourceName': 'users/ACTIVITY_USER_ID' }, 'album': { 'media': [ { 'url': 'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg', 'contentType': 'image/*', 'width': 1234, 'height': 4321, 'resourceName': 'media/MEDIA_RESOURCE_ID', } ] }, 'content': 'HTML_FORMATTED_CONTENT', 'link': { 'title': 'LINK_TITLE', 'url': 'LINK_ABSOLUTE_URL', 'imageUrl': 'LINK_ABSOLUTE_IMAGE_URL' }, 'comments': [ 'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz', 'author': { 'displayName': 'DISPLAY_NAME_WITHOUT_NICKNAME', 'profilePageUrl': 'https://plus.google.com/USER_ID', 'avatarImageUrl': 'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg', 'resourceName': 'users/USER_ID' }, 'content': 'HTML_FORMATTED_CONTENT', 'postURL': 'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID', 'resourceName': 'users/ACTIVITY_USER_ID/posts/THREAD_ID/comments/COMMENT_ID' ], 'resourceName': 'users/ACTIVITY_USER_ID/posts/THREAD_ID', 'plusOnes': [ 'plusOner': { 'displayName': 'DISPLAY_NAME_WITHOUT_NICKNAME', 'profilePageUrl': 'https://plus.google.com/USER_ID', 'avatarImageUrl': 'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg', 'resourceName': 'users/USER_ID' }, ], 'postAcl': { 'visibleToStandardAcl': { 'circles': [ { 'type': 'CIRCLE_TYPE_PUBLIC' }, { 'type': 'CIRCLE_TYPE_YOUR_CIRCLES' }, { 'type': 'CIRCLE_TYPE_EXTENDED_CIRCLES' }, { 'resourceName': 'circles/USER_ID-CIRCLE_ID', 'type': 'CIRCLE_TYPE_USER_CIRCLE', 'displayName': 'CIRCLE_DISPLAY_NAME' } ], 'users': [ { 'resourceName': 'users/USER_ID', 'displayName': 'DISPLAY_NAME_WITHOUT_NICKNAME' } ] }, 'communityAcl': { 'community': { 'resourceName': 'communities/COMMUNITY_ID', 'displayName': 'COMMUNITY_DISPLAY_NAME' }, 'users': [ { 'resourceName': 'users/USER_ID', 'displayName': 'DISPLAY_NAME_WITHOUT_NICKNAME' } ] }, 'eventAcl': { 'event': { 'resourceName': 'events/EVENT_ID' } }, 'collectionAcl': { 'collection': { 'resourceName': 'collections/COLLECTION_ID', 'displayName': 'COLLECTION_DISPLAY_NAME' } } } }
Note that this structure may still be incomplete, as it's currently only based on analysis of a few of the JSON files, from just a single user's takeout.
Legend[edit]
-
ACTIVITY_USER_ID
is the numeric Person ID of the user who posts the Activity; -
ACTIVITY_ID
a unique identifier consisting of alpha-numeric characters ([a-zA-Z0-9]
); -
USER_ID
is the numeric Person ID of the user within the scope of the current content. For instance, within the scope of 'comments', it is the Person ID of the user who posted the comment; -
DISPLAY_NAME_WITHOUT_NICKNAME
is the display name of the profile, however the nickname does not seem to be included, even if the user had set his nickname to be included within his display name. -
HTML_FORMATTED_CONTENT
is the content (post's or comment's body) formatted using HTML, rather than containing Google's own markdown-like markup/formatting language. -
LINK_TITLE
is the title of the linked (external) webpage, likely extracted from the webpage's <title> HTML tag. -
LINK_ABSOLUTE_URL
is the absolute URL of the linked (external) webpage (e.g. https://somesite.example/path/to/some/page). -
LINK_ABSOLUTE_IMAGE_URL
is the absolute URL of the linked (external) image resource (e.g. https://somesite.example/path/to/some/image.jpg). -
THREAD_ID
appears to be a unique identifier for a comment thread for an activity. Though it also seems to be a secondary Activity ID (perhaps used internally?) as it is not only used within comment threads, it also is used in the Activity's resourceName. -
COMMENT_ID
appears to be a unique identifier for a comment within a comment thread for an activity. It appears to consist of two sub-fragments, delimited by a hyphen (-). The first fragments appears to be the same for all comments within the same thread, and the last (shortest) fragment seems to be unique within that thread. -
COMMUNITY_ID
unique numeric identifier for the community in which the Activity was posted. -
COMMUNITY_DISPLAY_NAME
string, display name for the community in which the Activity was posted. -
EVENT_ID
unique alpha-numeric identifier for the Event Activity that was posted. -
MEDIA_RESOURCE_ID
an alpha-numeric URL-encoded identifier, or possibly a URL-encoded BASE64 content string. -
COLLECTION_ID
an alpha-numeric identifier for the Collection the Activity is in. -
COLLECTION_DISPLAY_NAME
the display name of the Collection. -
media
.width
andmedia
.height
are *integer* values representing their dimensions in pixels.
Within datetime strings:
-
YYYY
is a 4 digit year, -
mm
a 2 digit month (01-12), -
dd
a 2 digit day of the month (01-31), -
HH
a 2 digit hour (00-23), -
MM
a 2 digit minute of the hour (00-59), -
SS
2 digits indicating the seconds of the minute (00-59) and -
zzzz
2x2 digits indicating the hours and minutes offset of UTC.
For the postAcl
(Access Control List) data, there is some exclusivity regarding the items, but for completeness/overview sake all *possible* items are listed.
It's likely that you can have either a visibleToStandardAcl
item, an eventAcl
item, a communityAcl
item, or a collectionAcl
item and none of them combined with each other.
Furthermore, within visibleToStandardAcl
.circles
, the CIRCLE_TYPE_PUBLIC
-, CIRCLE_TYPE_YOUR_CIRCLES
- and CIRCLE_TYPE_EXTENDED_CIRCLES
-type circle items are likely mutually-exclusive, which means you can't combine one or more of them with each other. You can likely find one of them combined with zero, one or more CIRCLE_TYPE_USER_CIRCLE
-type circle item though.
Finally, within communityAcl
the users
item is optional.
Complete Flat JSON Structure[edit]
What follows is a list of all possible JSON keys encountered after analysing a complete archive of 2145 Google+ Stream posts from 2011-06-30 till 2018-10-23:
album album.media album.media[] album.media[].contentType album.media[].description album.media[].height album.media[].resourceName album.media[].url album.media[].width author author.avatarImageUrl author.displayName author.profilePageUrl author.resourceName comments comments[] comments[].author comments[].author.avatarImageUrl comments[].author.displayName comments[].author.profilePageUrl comments[].author.resourceName comments[].content comments[].creationTime comments[].link comments[].link.imageUrl comments[].link.title comments[].link.url comments[].media comments[].media.contentType comments[].media.height comments[].media.resourceName comments[].media.url comments[].media.width comments[].postUrl comments[].resourceName comments[].updateTime communityAttachment communityAttachment.coverPhotoUrl communityAttachment.displayName communityAttachment.resourceName content creationTime link link.imageUrl link.title link.url location location.displayName location.latitude location.longitude location.physicalAddress media media.contentType media.description media.height media.resourceName media.url media.width plusOnes plusOnes[] plusOnes[].plusOner plusOnes[].plusOner.avatarImageUrl plusOnes[].plusOner.displayName plusOnes[].plusOner.profilePageUrl plusOnes[].plusOner.resourceName postAcl postAcl.communityAcl postAcl.communityAcl.community postAcl.communityAcl.community.displayName postAcl.communityAcl.community.resourceName postAcl.communityAcl.users postAcl.communityAcl.users[] postAcl.communityAcl.users[].displayName postAcl.communityAcl.users[].resourceName postAcl.eventAcl postAcl.eventAcl.event postAcl.eventAcl.event.resourceName postAcl.isLegacyAcl postAcl.visibleToStandardAcl postAcl.visibleToStandardAcl.circles postAcl.visibleToStandardAcl.circles[] postAcl.visibleToStandardAcl.circles[].displayName postAcl.visibleToStandardAcl.circles[].resourceName postAcl.visibleToStandardAcl.circles[].type postAcl.visibleToStandardAcl.users postAcl.visibleToStandardAcl.users[] postAcl.visibleToStandardAcl.users[].displayName postAcl.visibleToStandardAcl.users[].resourceName resharedPost resharedPost.album resharedPost.album.media resharedPost.album.media[] resharedPost.album.media[].contentType resharedPost.album.media[].description resharedPost.album.media[].height resharedPost.album.media[].resourceName resharedPost.album.media[].url resharedPost.album.media[].width resharedPost.author resharedPost.author.avatarImageUrl resharedPost.author.displayName resharedPost.author.profilePageUrl resharedPost.author.resourceName resharedPost.content resharedPost.link resharedPost.link.imageUrl resharedPost.link.title resharedPost.link.url resharedPost.media resharedPost.media.contentType resharedPost.media.description resharedPost.media.height resharedPost.media.resourceName resharedPost.media.url resharedPost.media.width resharedPost.resourceName resharedPost.url reshares reshares[] reshares[].resharer reshares[].resharer.avatarImageUrl reshares[].resharer.displayName reshares[].resharer.profilePageUrl reshares[].resharer.resourceName resourceName updateTime url
This list was generated with the following one-liner:
jq -s 'map(.)' takeout_archive_2018/*.json | jq -r 'paths|map(.|tostring)|join(".")' |gsed -r 's/^[0-9]+(\.|$)//'|gsed -r 's/\.[0-9]+/[]/g'|sort -u > unique_keys.txt
(I'm using gsed to indicate it's GNU sed, rather than for instance the default sed that comes with for instance macOS; though the regular expression is probably simple enough to work regardless.)
A .
(period) indicates that the following key is a child, and []
(square brackets) indicate that the preceding parent key is an array.
So, 'reshares' is an array of which the child members contain the hash key 'resharer'
, which contain the keys 'avatarImageUrl'
, 'displayName'
, 'profilePageUrl'
and 'resourceName'
.
In JSON this would be equivalent to:
'reshares': [
{
'resharer': {
'avatarImageUrl': 'some url',
'displayName': 'some display name',
'profilePageUrl': 'some url',
'resourceName': 'some/resource/path'
}
}
]