Difference between revisions of "Data Migration Process and Considerations"

From PlexodusWiki
Jump to: navigation, search
(Categories: Use Title Case.)
Line 121: Line 121:
[[Category:Data migration]]
[[Category:Data Migration]]
[[Category:Google Data Takeout]]
[[Category:Google Data Takeout]]

Revision as of 12:01, 25 October 2018

Making use of Google+ data take out can be more complicated than it first appears, particularly for a large archive with contributions from many people or organisations. You'll want to consider:

  • What to archive
  • What you want to use from it.
  • How you plan to use the data.
  • What portions of the archive you want to, can be, and you have permissions to make public.
  • Where you plan to publish it, and what tools exist to import the selections you publish.

Remember: Data are liability. Information which is useful to you may also be dangerous, to yourself or others, if made generally available.

A basic data migration plan

The steps and processes given here as of 15 October 2018 are preliminary and more an outline than a procedure. We hope to improve and expand on them over time, particularly up to the mid-January 2019 window at which we anticipate many final export decisions, and January - April 2019 window during which import and republishing, will occur. Improvements are welcomed, particularly as Google clarifies capabilities, documentation, and processes, and destination platforms provide specific tools or processes.

The general steps are:

  1. Identifying the information types you want to keep.
  2. Identifying the information types you can or should keep.
  3. Determining how you plan to use that information. Examples include posting to a blog, importing to another social media site, creating a personal archive, importing addresses and contacts, or creating a new forum or community site.
  4. Exporting the data from Google.
  5. Storing it until you can process it and have verified the final importing process.
  6. Unpacking, identifying, and selecting archive components.
  7. Converting extracted data to useful or usable formats.
  8. Cleaning up, converting, or updating the archives (may be done later). Includes updating or removing user/author references, URLs, and the like.
  9. Importing the data to the target or destination platform.
  10. Verifying import.

The general data types withing your Google Take Out archive will be:

  • Your Google+ posts
  • Your Google+ comments
  • Others' comments on your Google+ posts.
  • Your uploaded photos, videos, and other media.
  • Contact information.
  • Your profile description and metadata. Generally: your name, vitals, contact information, and "About" page descriptions and links.
  • Miscellaneous other data.

This list may be inaccurate or incomplete.

Google+ Data Migration

Again the process is:

  • Exporting Google data
  • Storing Google+ Data
  • Extracting, classify, and converting Google+ data.
  • Import or publish to target platforms

Exporting Google+ data

This is a brief sketch of the process Dredmorbius used several years ago, on a Linux system. It should be fleshed out into a script or program. It is possible (though perhaps not likely) that Google will themselves provide tools or systems for managing archives. This request has been made and others are encouraged to request it. Google-provided support should include tools to select and import data 'responsibly' to destination platforms.. Responsible importing means respecting privacy scope.

You will have the option of specifying JSON or HTML formats. Google's JSON data is far more usable and useful than the HTML format, and is better supported by import tools.

The questions of want, can, may, and should refer to your preferences, abilities, permission, and risk exposure or resource limitations. Available export and import tools, copyright and other legal limitations or risks, privacy or appropriateness, and just general suitability, are among these considerations.

It may make sense to abandon some, much, or all of your data.

These are questions you and possibly your community must decide for yourselves.

The general process:

  1. Determine what data you want to, can, should, and may retain.
  2. Create your Google+ takeout. Select the JSON export format, NOT the HTML option.
  3. You probably want to include Posts, Comments, and Contacts, at a minimum, from Google+. You can include media such as photos, audio, video if you like or create a separate archive.
  4. Request the archive, and wait. Creation may take hours, possibly days. You will receive an email or notification when it is complete.
  5. If the archive fails or is incomplete, you will need to regenerate it. Reports are of many archival attempts failing. Google have been made aware of this, more feedback should help.

Storing Google+ data

  1. Store the archive in a safe place. This means somewhere where it can not be accidentally deleted or corrupted and where others cannot gain unauthorised access to it.
  2. Your G+ takeout _will almost certainly contain private data from or about you or others. Treat it like valuable and sensitive information. Again, there iss a saying in security circles: Data are liability.
  3. With some irony, Google Drive is probably one of the better options for storage: easy, accessible, durable, and reasonably safe. The information is already on Google, so you’re not changing the risk calculus too much.
  4. You may (and probably want to) download the archive to your own computer (laptop or desktop). Be aware that storing it there may be a risk for damage, loss, or breech.
  5. One or more offline copies, saved to USB storage, a local NAS or archive system, or CD, DVD, or Blu-Ray media, is another Good Practice. I recommend optical media. Keep in mind that at 750 MB, CDs have limited storage relative to archives which may be 1 - 100 GB or larger. Blu-Ray may be your best option here. Burn 2-3 sets and store securely in separate locations as protection against damage or loss.

That’s got you your archive.

Extracting, classifying, and converting Google+ data

  • The jq (JSON query) utility can query, output, and process JSON archives. This is how you extract information from the archive.
  • Your post and comment data will appear in Google-markup format. This includes _italic_ and *bold* markup, as well as internal Google profile references. I don’t recall their format, but references may not be directly translatable.
  • A simple shell script (sed, awk, perl, python, ruby) can substitute HTML or Markdown tags within the content. I don’t think Pandoc directly recognises G+ markdown, but if it’s close enough to AsciiDoc, on which I think it’s based, that’s another option.
  • Pandoc can create HTML, or any of dozens of other formats, from Markdown (and possibly directly from the G+ tags). So that’s how you get HTML.

You’ve still got the problems of:

  • Identifying post context, date, author, thread, and privacy scope. Those are contained in the JSON formats, but it’s been years since I’ve looked at them.
  • Determining which data you do and which you do NOT want to make public. Because you could be violating original privacy scope and intent.

Those two issues mean that you should NOT simply blindly import and publish your Google+ archive on some new site or platform. You will want to review content. Skipping any non-public material as a first option is a Very Good Practice.

That may still not be sufficient due to copyright or other considerations of possible criminal or civil liability, or simply annoying whomever the original content was written by or references. You will have to use judgement here. Again, this means that simply redirecting the entire archive is not a viable process.

Import or publish to target platforms

Finally, import the data or publish it to your intended platforms.

Contact information should be in Vcard format, supported by most email and contact-management systems. The contacts may _not_ be particularly useful if they don't include email, phone, or other non-Google+ addresses.

Publishing to your Exodus destination platform(s) will vary by platform and available tools. It's been suggested to Google that they work with major providers to facilitate this, including respecting privacy settings where appropriate. You are encouraged to provide similar feedback.

Further considerations

Data migration will take considerable time if you plan on making it public. Those who've ... had the pleasure of going through related processes a few times know that it can take months, for one indiviudual at the level of a few hundreds of items. You may find it simpler to just abandon a large archive of > 1,000 articles or so.

It is not necessary to convert all content in advance, so the process can be completed post-migration.

Assessing the scope of this task should be part of your pre-migration planning phase.