Google Takeout Data Structure

From PlexodusWiki
Jump to: navigation, search

This page describes the Google Takeout data structure.

Google+ Stream[edit]

Posts[edit]

Filename[edit]

For each Activity resource (the technical term the API reference uses for posts) a separate JSON file is created. Its filename appears to be named according to the following structure:

 YYYYMMDD - UNIQUE_POST_TITLE.json

or, represented as a Regular Expression:

 (?<year>[0-9]{4})(?<month>[0-9]{2})(?<day>[0-9]{2}) - (?<unique_post_title>.{0-42})\.json

where YYYYMMDD is the Activity's creation date consisting of a 4 digit year, 2 digit month and 2 digit day of month, and UNIQUE_POST_TITLE is a unique (within the scope of the same day) Activity identifier that's at most 42 characters long.

The UNIQUE_POST_TITLE generally is the first 39 characters of the Activity's content. However, if this would cause a duplicate filename within the same day, the last characters of this fragment will be replaced with a (\d+) suffix (e.g. (1)) where the number between the parentheses is an integer which starts at 1, and gets incremented by 1 till the suffixed title fragment is unique again within the scope of the day.

Data[edit]

The data is a standard JSON file. Its structure can be visualised as a Hash or associative array as follows:

 
  {
    'url':          'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID',
    'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz',
    'updateTime':   'YYYY-mm-dd HH:MM:SS:zzzz',
    'author':
    {
      'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
      'profilePageUrl': 'https://plus.google.com/ACTIVITY_USER_ID',
      'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
      'resourceName':    'users/ACTIVITY_USER_ID'
    },
    'album':
    {
      'media':
      [
        {
          'url':          'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
          'contentType':   'image/*',
          'width':        1234,
          'height':       4321,
          'resourceName': 'media/MEDIA_RESOURCE_ID',
        }
      ]
    },
    'content':          'HTML_FORMATTED_CONTENT',
    'link':
    {
      'title':   'LINK_TITLE',
      'url':     'LINK_ABSOLUTE_URL',
      'imageUrl': 'LINK_ABSOLUTE_IMAGE_URL'
    },
    'comments':
    [
      'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz',
      'author':
      {
        'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
        'profilePageUrl': 'https://plus.google.com/USER_ID',
        'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
        'resourceName':    'users/USER_ID'
      },
      'content':       'HTML_FORMATTED_CONTENT',
      'postURL':      'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID',
      'resourceName': 'users/ACTIVITY_USER_ID/posts/THREAD_ID/comments/COMMENT_ID'
    ],
    'resourceName':    'users/ACTIVITY_USER_ID/posts/THREAD_ID',
    'plusOnes':
    [
      'plusOner':
      {
        'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
        'profilePageUrl': 'https://plus.google.com/USER_ID',
        'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
        'resourceName':    'users/USER_ID'
      },
    ],
    'postAcl':
    {
      'visibleToStandardAcl':
      {
        'circles':
        [
          {
            'type': 'CIRCLE_TYPE_PUBLIC'
          },
          {
            'type': 'CIRCLE_TYPE_YOUR_CIRCLES'
          },
          {
            'type': 'CIRCLE_TYPE_EXTENDED_CIRCLES'
          },
          {
            'resourceName': 'circles/USER_ID-CIRCLE_ID',
            'type': 'CIRCLE_TYPE_USER_CIRCLE',
            'displayName': 'CIRCLE_DISPLAY_NAME'
          }
        ],
        'users':
        [
          {
            'resourceName': 'users/USER_ID',
            'displayName':  'DISPLAY_NAME_WITHOUT_NICKNAME'
          }
        ]
      },
      'communityAcl':
      {
        'community':
        {
          'resourceName': 'communities/COMMUNITY_ID',
          'displayName':  'COMMUNITY_DISPLAY_NAME'
        },
        'users':
        [
          {
            'resourceName': 'users/USER_ID',
            'displayName':  'DISPLAY_NAME_WITHOUT_NICKNAME'
          }
        ]
      },
      'eventAcl':
      {
        'event':
        {
          'resourceName': 'events/EVENT_ID'
        }
      },
      'collectionAcl':
      {
        'collection':
        {
          'resourceName': 'collections/COLLECTION_ID',
          'displayName':  'COLLECTION_DISPLAY_NAME'
        }
      }
    }
  }
  

Note that this structure may still be incomplete, as it's currently only based on analysis of a few of the JSON files, from just a single user's takeout.

Legend[edit]

  • ACTIVITY_USER_ID is the numeric Person ID of the user who posts the Activity;
  • ACTIVITY_ID a unique identifier consisting of alpha-numeric characters ([a-zA-Z0-9]);
  • USER_ID is the numeric Person ID of the user within the scope of the current content. For instance, within the scope of 'comments', it is the Person ID of the user who posted the comment;
  • DISPLAY_NAME_WITHOUT_NICKNAME is the display name of the profile, however the nickname does not seem to be included, even if the user had set his nickname to be included within his display name.
  • HTML_FORMATTED_CONTENT is the content (post's or comment's body) formatted using HTML, rather than containing Google's own markdown-like markup/formatting language.
  • LINK_TITLE is the title of the linked (external) webpage, likely extracted from the webpage's <title> HTML tag.
  • LINK_ABSOLUTE_URL is the absolute URL of the linked (external) webpage (e.g. https://somesite.example/path/to/some/page).
  • LINK_ABSOLUTE_IMAGE_URL is the absolute URL of the linked (external) image resource (e.g. https://somesite.example/path/to/some/image.jpg).
  • THREAD_ID appears to be a unique identifier for a comment thread for an activity. Though it also seems to be a secondary Activity ID (perhaps used internally?) as it is not only used within comment threads, it also is used in the Activity's resourceName.
  • COMMENT_ID appears to be a unique identifier for a comment within a comment thread for an activity. It appears to consist of two sub-fragments, delimited by a hyphen (-). The first fragments appears to be the same for all comments within the same thread, and the last (shortest) fragment seems to be unique within that thread.
  • COMMUNITY_ID unique numeric identifier for the community in which the Activity was posted.
  • COMMUNITY_DISPLAY_NAME string, display name for the community in which the Activity was posted.
  • EVENT_ID unique alpha-numeric identifier for the Event Activity that was posted.
  • MEDIA_RESOURCE_ID an alpha-numeric URL-encoded identifier, or possibly a URL-encoded BASE64 content string.
  • COLLECTION_ID an alpha-numeric identifier for the Collection the Activity is in.
  • COLLECTION_DISPLAY_NAME the display name of the Collection.
  • media.width and media.height are *integer* values representing their dimensions in pixels.

Within datetime strings:

  • YYYY is a 4 digit year,
  • mm a 2 digit month (01-12),
  • dd a 2 digit day of the month (01-31),
  • HH a 2 digit hour (00-23),
  • MM a 2 digit minute of the hour (00-59),
  • SS 2 digits indicating the seconds of the minute (00-59) and
  • zzzz 2x2 digits indicating the hours and minutes offset of UTC.

For the postAcl (Access Control List) data, there is some exclusivity regarding the items, but for completeness/overview sake all *possible* items are listed.

It's likely that you can have either a visibleToStandardAcl item, an eventAcl item, a communityAcl item, or a collectionAcl item and none of them combined with each other.

Furthermore, within visibleToStandardAcl.circles, the CIRCLE_TYPE_PUBLIC-, CIRCLE_TYPE_YOUR_CIRCLES- and CIRCLE_TYPE_EXTENDED_CIRCLES-type circle items are likely mutually-exclusive, which means you can't combine one or more of them with each other. You can likely find one of them combined with zero, one or more CIRCLE_TYPE_USER_CIRCLE-type circle item though.

Finally, within communityAcl the users item is optional.

Complete Flat JSON Structure[edit]

What follows is a list of all possible JSON keys encountered after analysing a complete archive of 2145 Google+ Stream posts from 2011-06-30 till 2018-10-23:

 album
 album.media
 album.media[]
 album.media[].contentType
 album.media[].description
 album.media[].height
 album.media[].resourceName
 album.media[].url
 album.media[].width
 author
 author.avatarImageUrl
 author.displayName
 author.profilePageUrl
 author.resourceName
 comments
 comments[]
 comments[].author
 comments[].author.avatarImageUrl
 comments[].author.displayName
 comments[].author.profilePageUrl
 comments[].author.resourceName
 comments[].content
 comments[].creationTime
 comments[].link
 comments[].link.imageUrl
 comments[].link.title
 comments[].link.url
 comments[].media
 comments[].media.contentType
 comments[].media.height
 comments[].media.resourceName
 comments[].media.url
 comments[].media.width
 comments[].postUrl
 comments[].resourceName
 comments[].updateTime
 communityAttachment
 communityAttachment.coverPhotoUrl
 communityAttachment.displayName
 communityAttachment.resourceName
 content
 creationTime
 link
 link.imageUrl
 link.title
 link.url
 location
 location.displayName
 location.latitude
 location.longitude
 location.physicalAddress
 media
 media.contentType
 media.description
 media.height
 media.resourceName
 media.url
 media.width
 plusOnes
 plusOnes[]
 plusOnes[].plusOner
 plusOnes[].plusOner.avatarImageUrl
 plusOnes[].plusOner.displayName
 plusOnes[].plusOner.profilePageUrl
 plusOnes[].plusOner.resourceName
 postAcl
 postAcl.communityAcl
 postAcl.communityAcl.community
 postAcl.communityAcl.community.displayName
 postAcl.communityAcl.community.resourceName
 postAcl.communityAcl.users
 postAcl.communityAcl.users[]
 postAcl.communityAcl.users[].displayName
 postAcl.communityAcl.users[].resourceName
 postAcl.eventAcl
 postAcl.eventAcl.event
 postAcl.eventAcl.event.resourceName
 postAcl.isLegacyAcl
 postAcl.visibleToStandardAcl
 postAcl.visibleToStandardAcl.circles
 postAcl.visibleToStandardAcl.circles[]
 postAcl.visibleToStandardAcl.circles[].displayName
 postAcl.visibleToStandardAcl.circles[].resourceName
 postAcl.visibleToStandardAcl.circles[].type
 postAcl.visibleToStandardAcl.users
 postAcl.visibleToStandardAcl.users[]
 postAcl.visibleToStandardAcl.users[].displayName
 postAcl.visibleToStandardAcl.users[].resourceName
 resharedPost
 resharedPost.album
 resharedPost.album.media
 resharedPost.album.media[]
 resharedPost.album.media[].contentType
 resharedPost.album.media[].description
 resharedPost.album.media[].height
 resharedPost.album.media[].resourceName
 resharedPost.album.media[].url
 resharedPost.album.media[].width
 resharedPost.author
 resharedPost.author.avatarImageUrl
 resharedPost.author.displayName
 resharedPost.author.profilePageUrl
 resharedPost.author.resourceName
 resharedPost.content
 resharedPost.link
 resharedPost.link.imageUrl
 resharedPost.link.title
 resharedPost.link.url
 resharedPost.media
 resharedPost.media.contentType
 resharedPost.media.description
 resharedPost.media.height
 resharedPost.media.resourceName
 resharedPost.media.url
 resharedPost.media.width
 resharedPost.resourceName
 resharedPost.url
 reshares
 reshares[]
 reshares[].resharer
 reshares[].resharer.avatarImageUrl
 reshares[].resharer.displayName
 reshares[].resharer.profilePageUrl
 reshares[].resharer.resourceName
 resourceName
 updateTime
 url

This list was generated with the following one-liner:

jq -s 'map(.)' takeout_archive_2018/*.json | jq -r 'paths|map(.|tostring)|join(".")' |gsed -r 's/^[0-9]+(\.|$)//'|gsed -r 's/\.[0-9]+/[]/g'|sort -u > unique_keys.txt

(I'm using gsed to indicate it's GNU sed, rather than for instance the default sed that comes with for instance macOS; though the regular expression is probably simple enough to work regardless.)

A . (period) indicates that the following key is a child, and [] (square brackets) indicate that the preceding parent key is an array. So, 'reshares' is an array of which the child members contain the hash key 'resharer', which contain the keys 'avatarImageUrl', 'displayName', 'profilePageUrl' and 'resourceName'.

In JSON this would be equivalent to:

 
 'reshares': [
   {
     'resharer': {
       'avatarImageUrl': 'some url',
       'displayName': 'some display name',
       'profilePageUrl': 'some url',
       'resourceName': 'some/resource/path'
     }
   }
 ]
 


See Also[edit]