Introduction to DataPortability and the state of its standards
Since November 2007, the DataPortability discussion has been big news. The following post is an overview of the underlying ideas of "data portability" as well as the status of the proposed technologies which will enable it.
What is Data Portability?
The idea behind "data portability" is that website visitors should be able to control the data that a site has about them. It's a philosophical ideal that harkens back to the early days of the Internet before big businesses got into the game, and the Internet was thought to be a place of independence and freedom. Today, Gmail has all my email, Facebook has my friends, flickr my photos, Amazon my book and music purchases, and so on. DataPortability is a work group which is trying to urge these web companies to allow its users to export all of their data and be able to import into another site at will. Over the past month, representatives from big players like Microsoft, Google, and Facebook have joined DataPortability to help create standards for data transmission.
Here's a video explanation:
What will I hopefully be able to do with Data Portability?
Though you can't do anything yet since "data portability" is just an idea right now, here are two possible scenarios:
Transferring friends and interests between sites. Facebook has data on who my "friends" are and information about things they like; Amazon has information on book and music I like based on what I've purchased and clicked on over the years; Pandora has information on what music I like, and so on. If one of my musician friends joined another social network like Bebo and I wanted to follow him, "data portability" would allow me to compile my interests and friends from Facebook, Amazon, and Pandora together and import that data into Bebo so I didn't have to start over there. It would already know what I like and who my friends are.
Collating my online activities. Much of my life has either "happened" on the web (blog posts, facebook messages, flickr photos, etc.) or I have left a record of real world events on the Internet. It would be good for me to be able to get all of this data and store it for my own use. Today I can't do this, but if all those sites were enabled with "data portability," I could theoretically pull out my photos from flickr, tickets from American Airlines, blog posts, emails, reviews at Trip Advisor, etc. and put together a timeline of my last vacation.
What possible problems are there with Data Portability?
"Data portabilty" sounds great in theory, but there are some problems that have to be overcome before it can really happen. These "problems" are not deal-breakers, just issues that the DataPortability workgroup is attempting to solve, and things users ought to think through before they immediately adopt such technology.
Identity verification. Probably the biggest technical hurdle is in the area of "identity management." If your list of friends has more than one person named "John Smith" or a friend with just a screen name like "JohnMayerLover556," then it will be difficult to import these into another site, especially if that site has has 10 more users named "John Smith." Websites usually assign a number to keep track of each user (a unique ID), but these cannot be shared among sites since they often overlap. Some standards propose using an email address as a unique ID, but people generally don't want their email address openly shared because of SPAM. One proposed standard (FOAF, see below) proposes using hashing the email address, but currently many sites don't allow email to be exported (such as Facebook). Also, since this information is being exported from a site, it is then made more public. This will require that "friends" can be verified in both directions since someone could merely copy the unique ID into their friends list.
Business value. There is some question as to why big websites would want to do "data portability" even if they've already joined the DataPortability workgroup. Most introductions to "data portability" usually use Amazon as an example, but interestingly Amazon has not yet joined. They have worked hard to build a great store, and they have collected a lot of data over a long period of time that they are not likely to want to give up to every new site that pops up. Similarly, Facebook's targeted advertisement platform is built on the interests that users enter into the site. This is an important part of their business model, and not something they are likely to give up unless there is a good business reason to do so. I don't think that Google, Microsoft, and Facebook joined DataPortability merely of good will and adherence to philosophical ideals of freedom, but because they believe they can make money doing so.
Security. As with "identity management" the overall security of publicly posting a lot of data about oneself could pose problems. For example, if I post all my interests, all my website associations, all my personal relationships in accessible XML files, then "identity" theft could be a genuine possibility. If someone can do damage with just a name, address, and social security number, imagine what they could do with data that allows them to mimic your lifestyle.
What technologies will hopefully enable Data Portability?
So "data portability" has awesome potential and some problems to work through. Another issue is what "standards" should be used to transfer this data. Long before the DataPortability workgroup was formed, several formats were proposed to handle this kind of information. Here is a summary of four of the most important ones:
- OpenID – Instead of having separate usernames and passwords for each site you visit, OpenID allows you to have one login across the internet that you control through a URL. For example, I could have openid.johndyer.name be my OpenID. If I go to a site that supports OpenID, I would not enter a username and password at that site, but just my OpenID (openid.johndyer.name). The site would send me to openid.johndyer.name where login and then verify that I wanted to return to the sending site. This way, I control the username and password and I can use it anywhere.
- SIOC (Semantically-Interlinked Online Communities) – SIOC's goal is to connect websites and content on those sites together. For example, it should be able to help identify blog posts, Amazon book reviews, blog comments, and forum posts made by the same person. Also, it should be able to semantically connect websites and blogs based on their tags and content. Right now, plugins exist for popular web applications (such as WordPress and phpBB), but there are very few applications which can actually do much with the data.
- FOAF (Friend of a Friend) – While SIOC is about websites and content, FOAF is more about people and relationships. The idea is to be able to represent "friend" lists in a standard way. Obviously this introduces security questions (do I really want my friends on Facebook importing my data into xanga?), but some have already been addressed. FOAF proposes using either OpenID or a hashed email as a unique ID. In addition to relationships, FOAF can also describes activities people do through calendars, photos, blogs, etc. (working example: Facebook FOAF exporter app)
- APML (Attention Profile Markup Language) – APML is the proposed standard for cataloging "interest." The idea is to take all of the things you do on the web (uploading photos, blog posting, music and book purchases, etc.) and organize the related subjects and topics into one standardized document of your interests. If you found a new music store or website, and you wanted to see if it had things you like, you could upload your APML document, and it would filter its data to only show things of interest to you. (see: http://www.engagd.com/ for a working example)
There are also additional proposals for relationships (XFN), personal details (hCard), and identity management (Yarid) which have some overlap with the standards above. Also, OMPL is a standard for collecting lists such as the RSS feeds. Here is the full list of proposed standards.
What does it all mean and what can I do today?
Since there are very few real world uses of these technologies, "data portability" is only an idea at this point. Right now, there are no sites listed as "DP Enabled" This is mostly because "DP" hasn't been fully defined yet which shows us how early in the game this all is. OpenID perhaps has the greatest number of sites which support it, though some sites are merely offering their own login as an OpenID (for example, Yahoo offers openid.yahoo.com/youryahooid as an OpenID URL). AMPL, FOAF, and SIOC are currently not supported by any major sites, and they may not for some time unless a unique ID can be reliably used. Also, it seems that the current standards can't really handle all that the entire "connect, control, share, remix" vision, so a lot of work will probably need to be done. Hopefully it will go more smoothly than HTML5 and ES4 development.
Interestingly, the idea behind "data portability" is a lot like the early 90s before the Internet was widely available and people were on closed networks like AOL and Compuserve. They could only send email to people on the same network, but when Netscape came on the scene, access became universal, email and HTML became standard, and it changed everything. Today users are on different networks (MySpace, Facebook, xanga, etc.) with no way to connect. Hopefully, "data portability" will be able to change all that.