Data Migration Process and Considerations

From PlexodusWiki
Jump to: navigation, search

Making use of Google+ data take out can be more complicated than it first appears, particularly for a large archive with contributions from many people or organisations. You'll want to consider:

  • What to archive
  • What you want to use from it.
  • How you plan to use the data.
  • What portions of the archive you want to, can be, and you have permissions to make public.
  • Where you plan to publish it, and what tools exist to import the selections you publish.

Remember: Data are liability. Information which is useful to you may also be dangerous, to yourself or others, if made generally available.


Cautions and disclaimers specific to Data Migration information

Our intent is to provide useful and helpful information. This is a Wiki, it is generally editable, though it is also patrolled by editors and administrators. We cannot guarantee that the information at any time is either correct or non-malicious, though we will make reasonable attempts that it is both.

You should ensure that Web pages represented as Google properties are in fact Google properties when you navigate to them.

Do not enter your Google username and password into any non-Google site or domain unless that is specifically what you intend.

(As an example: if you are using a third-party tool to manage your Google site.)

In general, Google will provide single-use, or single-application passwords for such access. Make use of such tools if at all possible.

We are intentionally keeping use of URLs on this page to an absolute minimum, and are presenting naked rather than formatted URLS (e.g., https://google.com/ rather than Google) in most instances. Verify URLs referenced here.

(This is vaguely paranoid on our part, but we're aware of potential for malice, vandalism, and abuse, and wish to minimise risks.)

Though we're referencing PlexodusWiki, this is good general guidance and practice.

Now on to the migration....


A basic data migration plan

The steps and processes given here as of 15 October 2018 are preliminary and more an outline than a procedure. We hope to improve and expand on them over time, particularly up to the mid-January 2019 window at which we anticipate many final export decisions, and January - April 2019 window during which import and republishing, will occur. Improvements are welcomed, particularly as Google clarifies capabilities, documentation, and processes, and destination platforms provide specific tools or processes.


The general steps are:

  1. Identifying the information types you want to keep.
  2. Identifying the information types you can or should keep.
  3. Determining how you plan to use that information. Examples include posting to a blog, importing to another social media site, creating a personal archive, importing addresses and contacts, or creating a new forum or community site.
  4. Exporting the data from Google.
  5. Storing it until you can process it and have verified the final importing process.
  6. Unpacking, identifying, and selecting archive components.
  7. Converting extracted data to useful or usable formats.
  8. Cleaning up, converting, or updating the archives (may be done later). Includes updating or removing user/author references, URLs, and the like.
  9. Importing the data to the target or destination platform.
  10. Verifying import.

The general data types withing your Google Take Out archive will be:

  • Your Google+ posts
  • Your Google+ comments
  • Others' comments on your Google+ posts.
  • Your uploaded photos, videos, and other media.
  • Contact information.
  • Your profile description and metadata. Generally: your name, vitals, contact information, and "About" page descriptions and links.
  • Miscellaneous other data.

This list may be inaccurate or incomplete.


Google+ Data Migration

Again the process is:

  • Exporting Google data
  • Storing Google+ Data
  • Extracting, classify, and converting Google+ data.
  • Import or publish to target platforms


Google provides for data export via its Google Data Takeout page, also referred to as Download your Data. This is part of the Data Liberation Front project within Google, all terms are used at various points.


The Google Data Takeout URL is: https://takeout.google.com/settings/takeout
PLEASE NOTE THAT YOU SHOULD VERIFY THAT THIS IS A GOOGLE DOMAIN AND YOU SHOULD VERIFY THIS LOCATION INDEPENDENTLY.

Google also provide help on Google Data takeout. We feel that some of the guidance is not as useful as it could be, but you should consult it here:

Google Account Help: Download your data

We recommend reading through the rest of this page before creating your data archive, as there are considerations presented here. We will be providing further guidance in future of choices we feel are preferable.

Please note that we can make no guarantee of information provided here, and that all liability is disclaimed. Information is provided in good faith, though this page is open to general editing.


Exporting Google+ data

There may (and almost certainly will) be tools for utilising your Google Data Takeout automatically, potentially online through Google tools and/or services, such as Google Drive.

As an alternative it is possible to work with the archive directly using commonly available tools on a Linux, MacOS, or Windows desktop or laptop computer. This should be considered an advanced and technical process. If you are not comfortable using the bash or similar command shells, and scripting languages such as awk, Perl, Python, Ruby, etc., you are strongly encouraged to skip this section.

This is a brief sketch of the process Dredmorbius used several years ago, on a Linux system. It should be fleshed out into a script or program. It is possible (though perhaps not likely) that Google will themselves provide tools or systems for managing archives. This request has been made and others are encouraged to request it. Google-provided support should include tools to select and import data 'responsibly' to destination platforms.. Responsible importing means respecting privacy scope.

You will have the option of specifying JSON or HTML formats for data export. Google's JSON data is far more usable and useful than the HTML format, and is better supported by import tools.

The questions of want, can, may, and should refer to your preferences, abilities, permission, and risk exposure or resource limitations. Available export and import tools, copyright and other legal limitations or risks, privacy or appropriateness, and just general suitability, are among these considerations.

It may make sense to abandon some, much, or all of your data.

These are questions you and possibly your community must decide for yourselves.

The general process:

  1. Determine what data you want to, can, should, and may retain.
  1. Create your Google+ takeout. Select the JSON export format, NOT the HTML option.
  1. You probably want to include Posts, Comments, and Contacts, at a minimum, from Google+. You can include media such as photos, audio, video if you like or create a separate archive.
  1. Request the archive, and wait. Creation may take hours, possibly days. You will receive an email or notification when it is complete.
  1. If the archive fails or is incomplete, you will need to regenerate it. Reports are of many archival attempts failing. Google have been made aware of this, more feedback should help.

Storing Google+ data

Your archive represents both valuable and potentially harmful information. Loss, modification, or disclosure could all pose dangers.

  1. Store the archive in a safe place. This means somewhere where it can not be accidentally deleted or corrupted and where others cannot gain unauthorised access to it.
  1. Your G+ takeout _will almost certainly contain private data from or about you or others. Treat it like valuable and sensitive information. Again, there iss a saying in security circles: Data are liability.
  1. With some irony, Google Drive is probably one of the better options for storage: easy, accessible, durable, and reasonably safe. The information is already on Google, so you’re not changing the risk calculus too much.
  1. You may (and probably want to) download the archive to your own computer (laptop or desktop). Be aware that storing it there may be a risk for damage, loss, or breech.
  1. One or more offline copies, saved to USB storage, a local NAS or archive system, or CD, DVD, or Blu-Ray media, is another Good Practice. I recommend optical media. Keep in mind that at 750 MB, CDs have limited storage relative to archives which may be 1 - 100 GB or larger. Blu-Ray may be your best option here. Burn 2-3 sets and store securely in separate locations as protection against damage or loss.

That’s got you your archive.

Extracting, classifying, and converting Google+ data

  • The jq (JSON query) utility can query, output, and process JSON archives. This is how you extract information from the archive.
  • Your post and comment data will appear in Google-markup format. This includes _italic_ and *bold* markup, as well as internal Google profile references. I don’t recall their format, but references may not be directly translatable.
  • A simple shell script (sed, awk, perl, python, ruby) can substitute HTML or Markdown tags within the content. I don’t think Pandoc directly recognises G+ markdown, but if it’s close enough to AsciiDoc, on which I think it’s based, that’s another option.
  • Pandoc can create HTML, or any of dozens of other formats, from Markdown (and possibly directly from the G+ tags). So that’s how you get HTML.

You’ve still got the problems of:

  • Identifying post context, date, author, thread, and privacy scope. Those are contained in the JSON formats, but it’s been years since I’ve looked at them.
  • Determining which data you do and which you do NOT want to make public. Because you could be violating original privacy scope and intent.

Those two issues mean that you should NOT simply blindly import and publish your Google+ archive on some new site or platform. You will want to review content. Skipping any non-public material as a first option is a Very Good Practice.

That may still not be sufficient due to copyright or other considerations of possible criminal or civil liability, or simply annoying whomever the original content was written by or references. You will have to use judgement here. Again, this means that simply redirecting the entire archive is not a viable process.


Takeout Data Structure

Google+ Stream

Posts
Filename

For each Activity resource (the technical term the API reference uses for posts) a separate JSON file is created. Its filename appears to be named according to the following structure:

 YYYYMMDD - UNIQUE_POST_TITLE.json

or, represented as a Regular Expression:

 (?<year>[0-9]{4})(?<month>[0-9]{2})(?<day>[0-9]{2}) - (?<unique_post_title>.{0-42})\.json

where YYYYMMDD is the Activity's creation date consisting of a 4 digit year, 2 digit month and 2 digit day of month, and UNIQUE_POST_TITLE is a unique (within the scope of the same day) Activity identifier that's at most 42 characters long.

The UNIQUE_POST_TITLE generally is the first 39 characters of the Activity's content. However, if this would cause a duplicate filename within the same day, the last characters of this fragment will be replaced with a (\d+) suffix (e.g. (1)) where the number between the parentheses is an integer which starts at 1, and gets incremented by 1 till the suffixed title fragment is unique again within the scope of the day.

Data

The data is a standard JSON file. Its structure can be visualised as a Hash or associative array as follows:

 
  {
    'url':          'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID',
    'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz',
    'updateTime':   'YYYY-mm-dd HH:MM:SS:zzzz',
    'author':
    {
      'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
      'profilePageUrl': 'https://plus.google.com/ACTIVITY_USER_ID',
      'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
      'resourceName':    'users/ACTIVITY_USER_ID'
    },
    'album':
    {
      'media':
      [
        {
          'url':          'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
          'contentType':   'image/*',
          'width':        1234,
          'height':       4321,
          'resourceName': 'media/MEDIA_RESOURCE_ID',
        }
      ]
    },
    'content':          'HTML_FORMATTED_CONTENT',
    'link':
    {
      'title':   'LINK_TITLE',
      'url':     'LINK_ABSOLUTE_URL',
      'imageUrl': 'LINK_ABSOLUTE_IMAGE_URL'
    },
    'comments':
    [
      'creationTime': 'YYYY-mm-dd HH:MM:SS:zzzz',
      'author':
      {
        'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
        'profilePageUrl': 'https://plus.google.com/USER_ID',
        'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
        'resourceName':    'users/USER_ID'
      },
      'content':       'HTML_FORMATTED_CONTENT',
      'postURL':      'https://plus.google.com/ACTIVITY_USER_ID/posts/ACTIVITY_ID',
      'resourceName': 'users/ACTIVITY_USER_ID/posts/THREAD_ID/comments/COMMENT_ID'
    ],
    'resourceName':    'users/ACTIVITY_USER_ID/posts/THREAD_ID',
    'plusOnes':
    [
      'plusOner':
      {
        'displayName':    'DISPLAY_NAME_WITHOUT_NICKNAME',
        'profilePageUrl': 'https://plus.google.com/USER_ID',
        'avatarImageUrl':  'https://lh3.googleusercontent.com/-RANDOM/PATH/TO/RESOURCE/photo.jpg',
        'resourceName':    'users/USER_ID'
      },
    ],
    'postAcl':
    {
      'visibleToStandardAcl':
      {
        'circles':
        [
          {
            'type': 'CIRCLE_TYPE_PUBLIC'
          },
          {
            'type': 'CIRCLE_TYPE_YOUR_CIRCLES'
          },
          {
            'type': 'CIRCLE_TYPE_EXTENDED_CIRCLES'
          },
          {
            'resourceName': 'circles/USER_ID-CIRCLE_ID',
            'type': 'CIRCLE_TYPE_USER_CIRCLE',
            'displayName': 'CIRCLE_DISPLAY_NAME'
          }
        ],
        'users':
        [
          {
            'resourceName': 'users/USER_ID',
            'displayName':  'DISPLAY_NAME_WITHOUT_NICKNAME'
          }
        ]
      },
      'communityAcl':
      {
        'community':
        {
          'resourceName': 'communities/COMMUNITY_ID',
          'displayName':  'COMMUNITY_DISPLAY_NAME'
        },
        'users':
        [
          {
            'resourceName': 'users/USER_ID',
            'displayName':  'DISPLAY_NAME_WITHOUT_NICKNAME'
          }
        ]
      },
      'eventAcl':
      {
        'event':
        {
          'resourceName': 'events/EVENT_ID'
        }
      },
      'collectionAcl':
      {
        'collection':
        {
          'resourceName': 'collections/COLLECTION_ID',
          'displayName':  'COLLECTION_DISPLAY_NAME'
        }
      }
    }
  }
  

Note that this structure may still be incomplete, as it's currently only based on analysis of a few of the JSON files, from just a single user's takeout.

Legend
  • ACTIVITY_USER_ID is the numeric Person ID of the user who posts the Activity;
  • ACTIVITY_ID a unique identifier consisting of alpha-numeric characters ([a-zA-Z0-9]);
  • USER_ID is the numeric Person ID of the user within the scope of the current content. For instance, within the scope of 'comments', it is the Person ID of the user who posted the comment;
  • DISPLAY_NAME_WITHOUT_NICKNAME is the display name of the profile, however the nickname does not seem to be included, even if the user had set his nickname to be included within his display name.
  • HTML_FORMATTED_CONTENT is the content (post's or comment's body) formatted using HTML, rather than containing Google's own markdown-like markup/formatting language.
  • LINK_TITLE is the title of the linked (external) webpage, likely extracted from the webpage's <title> HTML tag.
  • LINK_ABSOLUTE_URL is the absolute URL of the linked (external) webpage (e.g. https://somesite.example/path/to/some/page).
  • LINK_ABSOLUTE_IMAGE_URL is the absolute URL of the linked (external) image resource (e.g. https://somesite.example/path/to/some/image.jpg).
  • THREAD_ID appears to be a unique identifier for a comment thread for an activity. Though it also seems to be a secondary Activity ID (perhaps used internally?) as it is not only used within comment threads, it also is used in the Activity's resourceName.
  • COMMENT_ID appears to be a unique identifier for a comment within a comment thread for an activity. It appears to consist of two sub-fragments, delimited by a hyphen (-). The first fragments appears to be the same for all comments within the same thread, and the last (shortest) fragment seems to be unique within that thread.
  • COMMUNITY_ID unique numeric identifier for the community in which the Activity was posted.
  • COMMUNITY_DISPLAY_NAME string, display name for the community in which the Activity was posted.
  • EVENT_ID unique alpha-numeric identifier for the Event Activity that was posted.
  • MEDIA_RESOURCE_ID an alpha-numeric URL-encoded identifier, or possibly a URL-encoded BASE64 content string.
  • COLLECTION_ID an alpha-numeric identifier for the Collection the Activity is in.
  • COLLECTION_DISPLAY_NAME the display name of the Collection.
  • media.width and media.height are *integer* values representing their dimensions in pixels.

Within datetime strings:

  • YYYY is a 4 digit year,
  • mm a 2 digit month (01-12),
  • dd a 2 digit day of the month (01-31),
  • HH a 2 digit hour (00-23),
  • MM a 2 digit minute of the hour (00-59),
  • SS 2 digits indicating the seconds of the minute (00-59) and
  • zzzz 2x2 digits indicating the hours and minutes offset of UTC.

For the postAcl (Access Control List) data, there is some exclusivity regarding the items, but for completeness/overview sake all *possible* items are listed.

It's likely that you can have either a visibleToStandardAcl item, an eventAcl item, a communityAcl item, or a collectionAcl item and none of them combined with each other.

Furthermore, within visibleToStandardAcl.circles, the CIRCLE_TYPE_PUBLIC-, CIRCLE_TYPE_YOUR_CIRCLES- and CIRCLE_TYPE_EXTENDED_CIRCLES-type circle items are likely mutually-exclusive, which means you can't combine one or more of them with each other. You can likely find one of them combined with zero, one or more CIRCLE_TYPE_USER_CIRCLE-type circle item though.

Finally, within communityAcl the users item is optional.

Complete Flat JSON Structure

What follows is a list of all possible JSON keys encountered after analysing a complete archive of 2145 Google+ Stream posts from 2011-06-30 till 2018-10-23:

 album
 album.media
 album.media[]
 album.media[].contentType
 album.media[].description
 album.media[].height
 album.media[].resourceName
 album.media[].url
 album.media[].width
 author
 author.avatarImageUrl
 author.displayName
 author.profilePageUrl
 author.resourceName
 comments
 comments[]
 comments[].author
 comments[].author.avatarImageUrl
 comments[].author.displayName
 comments[].author.profilePageUrl
 comments[].author.resourceName
 comments[].content
 comments[].creationTime
 comments[].link
 comments[].link.imageUrl
 comments[].link.title
 comments[].link.url
 comments[].media
 comments[].media.contentType
 comments[].media.height
 comments[].media.resourceName
 comments[].media.url
 comments[].media.width
 comments[].postUrl
 comments[].resourceName
 comments[].updateTime
 communityAttachment
 communityAttachment.coverPhotoUrl
 communityAttachment.displayName
 communityAttachment.resourceName
 content
 creationTime
 link
 link.imageUrl
 link.title
 link.url
 location
 location.displayName
 location.latitude
 location.longitude
 location.physicalAddress
 media
 media.contentType
 media.description
 media.height
 media.resourceName
 media.url
 media.width
 plusOnes
 plusOnes[]
 plusOnes[].plusOner
 plusOnes[].plusOner.avatarImageUrl
 plusOnes[].plusOner.displayName
 plusOnes[].plusOner.profilePageUrl
 plusOnes[].plusOner.resourceName
 postAcl
 postAcl.communityAcl
 postAcl.communityAcl.community
 postAcl.communityAcl.community.displayName
 postAcl.communityAcl.community.resourceName
 postAcl.communityAcl.users
 postAcl.communityAcl.users[]
 postAcl.communityAcl.users[].displayName
 postAcl.communityAcl.users[].resourceName
 postAcl.eventAcl
 postAcl.eventAcl.event
 postAcl.eventAcl.event.resourceName
 postAcl.isLegacyAcl
 postAcl.visibleToStandardAcl
 postAcl.visibleToStandardAcl.circles
 postAcl.visibleToStandardAcl.circles[]
 postAcl.visibleToStandardAcl.circles[].displayName
 postAcl.visibleToStandardAcl.circles[].resourceName
 postAcl.visibleToStandardAcl.circles[].type
 postAcl.visibleToStandardAcl.users
 postAcl.visibleToStandardAcl.users[]
 postAcl.visibleToStandardAcl.users[].displayName
 postAcl.visibleToStandardAcl.users[].resourceName
 resharedPost
 resharedPost.album
 resharedPost.album.media
 resharedPost.album.media[]
 resharedPost.album.media[].contentType
 resharedPost.album.media[].description
 resharedPost.album.media[].height
 resharedPost.album.media[].resourceName
 resharedPost.album.media[].url
 resharedPost.album.media[].width
 resharedPost.author
 resharedPost.author.avatarImageUrl
 resharedPost.author.displayName
 resharedPost.author.profilePageUrl
 resharedPost.author.resourceName
 resharedPost.content
 resharedPost.link
 resharedPost.link.imageUrl
 resharedPost.link.title
 resharedPost.link.url
 resharedPost.media
 resharedPost.media.contentType
 resharedPost.media.description
 resharedPost.media.height
 resharedPost.media.resourceName
 resharedPost.media.url
 resharedPost.media.width
 resharedPost.resourceName
 resharedPost.url
 reshares
 reshares[]
 reshares[].resharer
 reshares[].resharer.avatarImageUrl
 reshares[].resharer.displayName
 reshares[].resharer.profilePageUrl
 reshares[].resharer.resourceName
 resourceName
 updateTime
 url

This list was generated with the following one-liner:

jq -s 'map(.)' takeout_archive_2018/*.json | jq -r 'paths|map(.|tostring)|join(".")' |gsed -r 's/^[0-9]+(\.|$)//'|gsed -r 's/\.[0-9]+/[]/g'|sort -u > unique_keys.txt

(I'm using gsed to indicate it's GNU sed, rather than for instance the default sed that comes with for instance macOS; though the regular expression is probably simple enough to work regardless.)

A . (period) indicates that the following key is a child, and [] (square brackets) indicate that the preceding parent key is an array. So, 'reshares' is an array of which the child members contain the hash key 'resharer', which contain the keys 'avatarImageUrl', 'displayName', 'profilePageUrl' and 'resourceName'.

In JSON this would be equivalent to:

 
 'reshares': [
   {
     'resharer': {
       'avatarImageUrl': 'some url',
       'displayName': 'some display name',
       'profilePageUrl': 'some url',
       'resourceName': 'some/resource/path'
     }
   }
 ]
 

Import or publish to target platforms

Finally, import the data or publish it to your intended platforms.

Contact information should be in Vcard format, supported by most email and contact-management systems. The contacts may _not_ be particularly useful if they don't include email, phone, or other non-Google+ addresses.

Publishing to your Exodus destination platform(s) will vary by platform and available tools. It's been suggested to Google that they work with major providers to facilitate this, including respecting privacy settings where appropriate. You are encouraged to provide similar feedback.

Further considerations

Data migration will take considerable time if you plan on making the data public. Those who've ... had the pleasure of going through related processes a few times know that it can take months, for one indiviudual at the level of a few hundreds of items. You may find it simpler to just abandon a large archive of > 1,000 articles or so.

It is not necessary to convert all content in advance, so the process can be completed post-migration.

Assessing the scope of this task should be part of your pre-migration planning phase.


References