Data Migration Process and Considerations

From PlexodusWiki
Revision as of 23:04, 18 December 2018 by Dredmorbius (Talk | contribs)

Jump to: navigation, search

The goal of data migration is to make use or provide access to data elsewhere, after Google+ is shut down. The methods and procedures used are to further that goal.

Making use of Google+ data take out can be more complicated than it first appears, particularly for a large archive with contributions from many people or organisations. You'll want to consider:

  • What to archive
  • What you want to use from it.
  • How you plan to use the data.
  • What portions of the archive you want to, can be, and you have permissions to make public.
  • Where you plan to publish it, and what tools exist to import the selections you publish.

Remember: Data are liability. Information which is useful to you may also be dangerous, to yourself or others, if made generally available.

This page is under development as we explore and confirm procedures for Google Data Takeout. Consider information preliminary, as of 3 December 2018.


Cautions and disclaimers specific to Data Migration information

Our intent is to provide useful and helpful information. This is a Wiki, it is generally editable, though it is also patrolled by editors and administrators. We cannot guarantee that the information at any time is either correct or non-malicious, though we will make reasonable attempts that it is both.

You should ensure that Web pages represented as Google properties are in fact Google properties when you navigate to them.

Do not enter your Google username and password into any non-Google site or domain unless that is specifically what you intend.

(As an example: if you are using a third-party tool to manage your Google site.)

In general, Google will provide single-use, or single-application passwords for such access. Make use of such tools if at all possible.

We are intentionally keeping use of URLs on this page to an absolute minimum, and are presenting naked rather than formatted URLS (e.g., https://google.com/ rather than Google) in most instances. Verify URLs referenced here.

(This is vaguely paranoid on our part, but we're aware of potential for malice, vandalism, and abuse, and wish to minimise risks.)

Though we're referencing PlexodusWiki, this is good general guidance and practice.

Now on to the migration....


Data migration goals

The actions we undertake are usually driven by a goal. That is the case for data migration.

Why you migrate data is to preserve or provide access to it elsewhere. In this case, after Google+ is no longer available.

To determine your goals, ask yourself questions, such as:

  • Do I, or others, want, need, require, or have rights to, or obligations to provide, data now on Google+ which I can migrate? If no, then you can ignore the rest of this page. If yes, then consider further...
  • Are there conflicting interests or risks to not preserve the data?
  • Is the planned use private or public? If private, for one person or shared amongst people? How will you provide shared access?
  • How much data do you need to migrate? If it's a small amount, storage and transfer are not big concerns. If it's a large amount, you'll need to think about where you store it and how you get it there. If you don't have storage, or you have poor Internet service, you may want to improve these or find alternatives that eliminate these limitations.
  • What formats are you interested in? Text, images, audio, video, other? Do these need to be kept associated or together, organised, selected, edited, modified, or adapted?
  • Do you have rights to the data? Is this your own work, public domain, or others' information, including comments or posts by others? Publishing, or even copying or transferring information may have legal or other considerations. Data are liability. Rules and laws have changed significantly in recent years.
  • What systems do you plan on using the data on? A local computer? A blog? An existing online service? What form does it need to be in, how does it need to be organised, what formatting or other considerations are needed?
  • If this is a Community or Group resource: what is the understanding among group members over what information will be retained or deleted? Do individuals have a say, and if so, how is this expressed and enacted, whether for preservation or deletion?
  • Is any of the material illegal or does it pose a legal risk, in your local jurisdiction or elsewhere? Nudity, pornography, weapons-related material, religious materials, certain actions or solicitations, political and historical materials, and more, may be restricted or illegal. How will you address this? Does use pose a risk to you? Others?


Typical Data Migration User Stories

A "user story" is a scenario describing how someone might make use of some system or facility, and is often used in product planning and development. We're using them to explore data migration possibilities. If you're still trying to determine how you or others might need or want to migrate data, this may be useful. If you know what you intend to do, you may skip this section.

As some simple examples, describing goals

  • Preserving your Google+ posts and comments for personal use on your own computer, as a reference. This might include material from others.
  • Republishing your Google+ content on a blog. This might involve reformatting or re-editing content, matching images or other references initially supplied, or including comments and content by others.
  • Creating a local archive of Google+ Photos.
  • Republishing photos to a blog or media site.
  • Migrating a G+ Community to a new platform. This might include community members, metadata, and old posts and comments.
  • Moving a business or product support site to a new online location.

There are many other possibilities, additions to this list are welcomed.



A basic data migration plan

The steps and processes given here as of 15 October 2018 are preliminary and more an outline than a procedure. We hope to improve and expand on them over time, particularly up to the mid-January 2019 window at which we anticipate many final export decisions, and January - April 2019 window during which import and republishing, will occur. Improvements are welcomed, particularly as Google clarifies capabilities, documentation, and processes, and destination platforms provide specific tools or processes.


The general steps are:

  1. Identifying the information types you want to keep.
  2. Identifying the information types you can or should keep.
  3. Determining how you plan to use that information. Examples include posting to a blog, importing to another social media site, creating a personal archive, importing addresses and contacts, or creating a new forum or community site.
  4. Exporting the data from Google.
  5. Storing it until you can process it and have verified the final importing process.
  6. Unpacking, identifying, and selecting archive components.
  7. Converting extracted data to useful or usable formats.
  8. Cleaning up, converting, or updating the archives (may be done later). Includes updating or removing user/author references, URLs, and the like.
  9. Importing the data to the target or destination platform.
  10. Verifying import.

The general data types withing your Google Take Out archive will be:

  • Your Google+ posts
  • Your Google+ comments
  • Others' comments on your Google+ posts.
  • Your uploaded photos, videos, and other media.
  • Contact information.
  • Your profile description and metadata. Generally: your name, vitals, contact information, and "About" page descriptions and links.
  • Miscellaneous other data.

This list may be inaccurate or incomplete.


Google+ Data Migration

Again the process is:

  • Export Google data
  • Transfer to storage
  • Storing Google+ Data
  • Select, filter, classify, and convert Google+ data.
  • Import or publish to target platforms


Google provides for data export via its Google Data Takeout page, also referred to as Download your Data. This is part of the Data Liberation Front project within Google, all terms are used at various points.


The Google Data Takeout URL is: https://takeout.google.com/settings/takeout
PLEASE NOTE THAT YOU SHOULD VERIFY THAT THIS IS A GOOGLE DOMAIN AND YOU SHOULD VERIFY THIS LOCATION INDEPENDENTLY.

It is also possible to specify specific products to be archived on the URL, for example, Google+ Pages, Circles, Stream, Plus Ones, and Profile, as here:

https://takeout.google.com/settings/takeout/custom/plus_pages,circles,stream,plus_one,profile

This should provide a comprehensive and sufficient Google+ data archive. We are still tuning the selection, and it's likely that plus_pages and plus_ones are not needed, though they add little bulk to most archives.

TODO: Confirm elements selected.


Google also provide help on Google Data takeout. We feel that some of the guidance is not as useful as it could be, but you should consult it here:

Google Account Help: Download your data

We recommend reading through the rest of this page before creating your data archive, as there are considerations presented here. We will be providing further guidance in future of choices we feel are preferable.

Please note that we can make no guarantee of information provided here, and that all liability is disclaimed. Information is provided in good faith, though this page is open to general editing.



Exporting Google+ data

There may (and almost certainly will) be tools for utilising your Google Data Takeout automatically, potentially online through Google tools and/or services, such as Google Drive.

As an alternative it is possible to work with the archive directly using commonly available tools on a Linux, MacOS, or Windows desktop or laptop computer. This should be considered an advanced and technical process. If you are not comfortable using the bash or similar command shells, and scripting languages such as awk, Perl, Python, Ruby, etc., you are strongly encouraged to skip this section.

This is a brief sketch of the process Dredmorbius used several years ago, on a Linux system. It should be fleshed out into a script or program. It is possible (though perhaps not likely) that Google will themselves provide tools or systems for managing archives. This request has been made and others are encouraged to request it. Google-provided support should include tools to select and import data 'responsibly' to destination platforms.. Responsible importing means respecting privacy scope.

You will have the option of specifying JSON or HTML formats for data export. Google's JSON data is far more usable and useful than the HTML format, and is better supported by import tools.

The questions of want, can, may, and should refer to your preferences, abilities, permission, and risk exposure or resource limitations. Available export and import tools, copyright and other legal limitations or risks, privacy or appropriateness, and just general suitability, are among these considerations.

It may make sense to abandon some, much, or all of your data.

These are questions you and possibly your community must decide for yourselves.

The general process:

  1. Determine what data you want to, can, should, and may retain.
  1. Create your Google+ takeout. Select the JSON export format, NOT the HTML option.
  1. You probably want to include Posts, Comments, and Contacts, at a minimum, from Google+. You can include media such as photos, audio, video if you like or create a separate archive.
  1. Request the archive, and wait. Creation may take hours, possibly days. You will receive an email or notification when it is complete.
  1. If the archive fails or is incomplete, you will need to regenerate it. Reports are of many archival attempts failing. Google have been made aware of this, more feedback should help.


Storing Google+ data

Your archive represents both valuable and potentially harmful information. Loss, modification, or disclosure could all pose dangers.

  1. Store the archive in a safe place. This means somewhere where it can not be accidentally deleted or corrupted and where others cannot gain unauthorised access to it.
  1. Your G+ takeout will almost certainly contain private data from or about you or others. Treat it like valuable and sensitive information. Again, there is a saying in security circles: Data are liability.
  1. With some irony, Google Drive is probably one of the better options for storage: easy, accessible, durable, and reasonably safe. The information is already on Google, so you’re not changing the risk calculus too much.
  1. You may (and probably want to) download the archive to your own computer (laptop or desktop). Be aware that storing it there may be a risk for damage, loss, or breech.
  1. One or more offline copies, saved to USB storage, a local NAS or archive system, or CD, DVD, or Blu-Ray media, is another Good Practice. I recommend optical media. Keep in mind that at 750 MB, CDs have limited storage relative to archives which may be 1 - 100 GB or larger. Blu-Ray may be your best option here. Burn 2-3 sets and store securely in separate locations as protection against damage or loss.

That’s got you your archive.


Extracting, classifying, and converting Google+ data

  • The jq (JSON query) utility can query, output, and process JSON archives. This is how you extract information from the archive.
  • Your post and comment data will appear in Google-markup format. This includes _italic_ and *bold* markup, as well as internal Google profile references. I don’t recall their format, but references may not be directly translatable.
  • A simple shell script (sed, awk, perl, python, ruby) can substitute HTML or Markdown tags within the content. I don’t think Pandoc directly recognises G+ markdown, but if it’s close enough to AsciiDoc, on which I think it’s based, that’s another option.
  • Pandoc can create HTML, or any of dozens of other formats, from Markdown (and possibly directly from the G+ tags). So that’s how you get HTML.

You’ve still got the problems of:

  • Identifying post context, date, author, thread, and privacy scope. Those are contained in the JSON formats, but it’s been years since I’ve looked at them.
  • Determining which data you do and which you do NOT want to make public. Because you could be violating original privacy scope and intent.

Those two issues mean that you should NOT simply blindly import and publish your Google+ archive on some new site or platform. You will want to review content. Skipping any non-public material as a first option is a Very Good Practice.

That may still not be sufficient due to copyright or other considerations of possible criminal or civil liability, or simply annoying whomever the original content was written by or references. You will have to use judgement here. Again, this means that simply redirecting the entire archive is not a viable process.

For specifics on the data structure see Google Takeout Data Structure.


Import or publish to target platforms

Finally, import the data or publish it to your intended platforms.

Contact information should be in Vcard format, supported by most email and contact-management systems. The contacts may not be particularly useful if they don't include email, phone, or other non-Google+ addresses.

Publishing to your Exodus destination platform(s) will vary by platform and available tools. It's been suggested to Google that they work with major providers to facilitate this, including respecting privacy settings where appropriate. You are encouraged to provide similar feedback.


If you have problems

Google's data takeout does not always work as desired. Most crucially, the end goal is not downloading your data, but creating something useful from it. People don't want backups, they want restores. What your "restore" is may vary: porting G+ posts to a blog, photos to a new service, creating a locally-accessible data resource, or moving your Community to a new home, as examples.

(TODO: We plan on addressing specific use-cases for data takeout and migration.)

That process involves creating an archive, storing it securely, selecting and converting the portions you want to keep, and importing them to a home. Failures may occur at any of these steps, and you'll likely want to refer to one (or more) of these in the event of a Google Data Takeout error. Whilst Google likely views its task as creating archives in truth it should keep the end goal in mind, creating restores.

Issues you encounter will likely be associated with one of these areas, which you may indicate in any reporting:

  • Archive specification: where and how to request and specify an archive.
  • Archive creation: completeness, data and file formats, contents.
  • Transfer to storage: downloading or transferring to local storage, Google Drive, Dropbox, etc.
  • Storage: retaining the archive, avoiding loss, deletion, corruption, damage, or unauthorised access.
  • Filtering and conversion: selecting specific content for use elsewhere.
  • Import and publication: incorporating data in new public or private platforms.


How to Report Issues

Please try to formulate your issues in the form of Expected Behaviour and Actual Behaviour, and include a concise but detailed description of your issue. Make clear the issue is with the Google+ Data Takeout archive.

A suggested feedback template is:


Short title-like description of issue

Longer description of issue


Expected behaviour:

Concise result you expected.


Actual behaviour:

Concise result you actually encountered.


Steps to reproduce:

Step-by-step description of your actions

In general, useful feedback:

  • Describes what you expected to happen.
  • Describes what actually happened.
  • Describes what you did.
  • Does not try to explain why something occurred. Software moves in mysterious ways, and true causes are often not immediately apparent to the user (or the troubleshooter).
  • May describe other limitations or events. If you have personal, technical, or infrastructure limitations (visual impairment, slow and unreliable Internet, limited local storage space), which constrain your options, this information may be useful to a troubleshooter in solving your problem if not their system's behaviour.
  • May describe what your goal was. "I was doing X to try to accomplish Y when Z happened." This can especially useful.

This last point is known as the XY Problem. This is ...

a communication problem encountered in help desk and similar situations in which the real issue ("X") of the person asking for help is obscured, because instead of asking directly about issue X, they ask how to solve a secondary issue ("Y") which they believe will allow them to resolve issue X. However, resolving issue Y often does not resolve issue X, or is a poor way to resolve it, and the obscuring of the real issue and the introduction of the potentially strange secondary issue can lead to the person trying to help having unnecessary difficulties in communication and offering poor solutions.

A surprising amount of tech support involves understanding what a user is actually trying to accomplish rather than what they said they want to do. You can short-cut this process by starting with stating your goals rather than your methods.

(The other surprising amount of tech support involves understanding what a computer system did rather than what you wanted it to do.)

((Many of the rest involves ensuring that the right plug is inserted the right way into the right socket, switches are turned on correctly, and vendors aren't lying to you, deliberately or otherwise.))


Where to report issues

Report your issues as Google Feedback and to the Google+ Mass Migration or Google+ Help communities.

Google's much maligned "Send Feedback" service does provide value. We have seen issues reported there being addressed over the 8 October - 14 December 2018 period, following the Google+ Sunset announcement and increased demands on Google Data Takeout. The tool is frustrating for users, and by reports, for Google's engineers as well, but it does feed into Google's internal systems and produces results, if slowly.

Reports being posted to the G+MM and G+H are being compiled and passed on to Google engineers. Member volunteers are helping to categorise and classify these, and feed the information forward.

At this point (December 2018) we are not requesting reports to the Plexodus subreddit, though that may be a future option.

Expect to see bugs and problems addressed with higher priority than features and capabilities. With limited time and resources, peak priority is on ensuring that basic data takeout functionality works. Providing additional features or new functionality is far less likely. These may also raise conflicts or concerns more apparent to Google than users.


Troubleshooting

There are relatively few options to resolve issues. The standard recommendation is "file Feedback, re-attempt the Archive".

But to cover some basics:

  • Confirm that you are logged in as the appropriate Google+ user for the account(s) or Page(s) you are attempting to archive.
  • Confirm that you've selected the products you want to download.
  • Confirm that you've not selected unnecessary products.
  • Confirm that you've selected the appropriate data formats. In general, 'use JSON and do NOT request HTML formats, where these are presented.
  • Confirm that you've selected the appropriate file format. Generally ZIP is more suitable than tar.gz.
  • Confirm that you've selected an appropriate file chunk size. If you are having problems with downloads, selecting a smaller file chunk size should permit more flexibility.
  • Consider restarting your browser or rebooting your computer if memory or CPU resources are limited.

Otherwise:

  • If the archive itself is corrupted, but downloads successfully, file feedback and create a new archive.
  • If the archive is incomplete, file feedback and create a new archive.
  • If the download fails, re-try the download.
  • If the download fails repeatedly, determine if you have Internet connectivity or service issues (contact your ISP or mobile data provider), and consider smaller chunk sizes if possible. Check local wiring and networking equipment as well, including your computer's networking cable or WiFi connection and all connected / associated equipment.
  • If downloads continue to fail or you have a high-cost metered network connection consider an alternative archive storage option: Google Drive, DropBox, etc.

For other problems, file Feedback and report the problem to the Google+ Mass Migration and/or Google+ Help communities.


Data Duplication

An emerging issue with Google Data Takeout is repeated storage of items, especially images. The result is absolutely enormous, and useless, data archives. They are too big to be stored, transferred, or utilised.

Even without duplication, images can be the major component of archives, comprising 90% or more of all data.

What we're seeing is numerous duplicate copies of precisely the same image being stored within archives. What could be a few hundred megabytes of data can turn into many gigabytes or even a terabyte or more, by some reports -- that's a million times larger.

It's not possible to exclude images from the Takeout request, though you should NOT request "Photos", as this is the separate Google Photos product. That includes Google+ images uploaded early in the product's history (the split seems to have come in April 2016), but any G+ photos are already included in the stream archive. You can reduce the total data storage and transfer by up to half, possibly more.

An alternative is to use the Google+ Exporter from Friends+Me, which downloads text only, though image URLs are included, and can be used to retrieve images independently. An image downloader is under development by the company.

As of 18 December 2018, data duplication is an unresolved but major issue with Google Data Takeout. Google have been made aware of this.

Further considerations

Data migration will take considerable time if you plan on making the data public. Those who've ... had the pleasure of going through related processes a few times know that it can take months, for one indiviudual at the level of a few hundreds of items. You may find it simpler to just abandon a large archive of > 1,000 articles or so.

It is not necessary to convert all content in advance, so the process can be completed post-migration.

Assessing the scope of this task should be part of your pre-migration planning phase.


Third Party Tools

Independent tools created by third parties are beginning to appear, be announced, and/or begin development. Use of these entails risk of exposure to your Google account, your data, and third parties' data, please be aware of this.

As of 30 November 2018, status of all these tools should be considered beta/experimental/in development unless specifically otherwise noted.


A list of tools that would be useful includes:

  • Takeout.G+Streams.Posts - > Blogger
  • Takeout.G+Streams.Posts - > Atom
  • Takeout.G+Streams.Posts - > Wordpress
  • Takeout.G+Streams.Posts - > Reddit
  • Takeout.G+Streams.Posts - > Other platforms that have an import or post API
  • Takeout.G+Streams.Posts - > Static HTML as a better alternative to that provided by Google.
  • Takeout.G+Streams.Posts.html - > Extract <body> section to files.
  • Takeout.G+Streams.Circles - > Enhanced VCard/CSV with additional data via G+API.people.get
  • Takeout - fix the filenames to deal with UTF-8 characters

Suggested by Julian Bond.


References