{"id":316,"date":"2010-07-02T06:33:22","date_gmt":"2010-07-02T06:33:22","guid":{"rendered":"http:\/\/www.dgen.net\/blog\/?p=316"},"modified":"2018-11-08T23:46:19","modified_gmt":"2018-11-08T23:46:19","slug":"data-is-not-binary","status":"publish","type":"post","link":"https:\/\/dgen.net\/0\/2010\/07\/02\/data-is-not-binary\/","title":{"rendered":"Data is not binary"},"content":{"rendered":"<p><strong>Science, data, internet, ontology, work and non-work themes converging &#8211; my post on <a href=\"http:\/\/radar.oreilly.com\/2010\/06\/data-is-not-binary.html\">O&#8217;Reilly Radar<\/a>, reposted below<\/strong><\/p>\n<h3 class=\"subhead\">Why open data requires credibility and transparency.<\/h3>\n<div class=\"entry-meta\"><span class=\"meta-sep\">by<\/span> <span class=\"author vcard\"><a class=\"url fn n\" title=\"View all posts by Gavin Starks\" href=\"http:\/\/radar.oreilly.com\/gavin\" rel=\"author\">Gavin Starks<\/a><\/span> | <span class=\"entry-date\">June 30, 2010<\/span><\/div>\n<div class=\"entry-utility social\"><\/div>\n<div class=\"entry-content\">\n<p><em>Guest blogger Gavin Starks is founder and CEO of <a href=\"http:\/\/www.amee.com\/\">AMEE<\/a>, a neutral aggregation platform designed to measure and track all the energy data in the world..<\/em><\/p>\n<p>The World Bank has stated that \u201c<a href=\"http:\/\/blogs.worldbank.org\/dmblog\/open-vs-public-data-the-big-difference\">data in document format is effectively useless<\/a>\u201c.<\/p>\n<p>However, \u201copen data\u201d is only the beginning of a journey. Simply applying the rules of open source as applied to software may help us take the first steps, but there are new categories of challenges to face.<\/p>\n<h3>Data needs to be computable (ie. acted upon in context)<\/h3>\n<p>\u201cData\u201d is a much broader term than \u201ccode.\u201d The term embodies a range of dimensions: there are more than just the numbers at play, especially with scientific data.<\/p>\n<ul>\n<li>How was the data collected?<\/li>\n<li>How should the data be used?<\/li>\n<li>Are the models for processing the data valid?<\/li>\n<li>What assumptions exist, in words and equations?<\/li>\n<li>What is the significance of the assumptions?<\/li>\n<\/ul>\n<p>In an age when peer review is an anachronism, we are searching for new solutions for \u201cscientific content management\u201d. When <a href=\"http:\/\/en.wikipedia.org\/wiki\/Pascal%27s_Wager\">Pascal\u2019s Wager<\/a> is evoked, it is equally important to remember <a href=\"http:\/\/en.wikipedia.org\/wiki\/Godel#The_Incompleteness_Theorem\">Godel\u2019s incompleteness theorems<\/a> (in complex enough systems, logic can be used to prove anything, including untrue statements).<\/p>\n<p>Only eight percent of members of the Scientific Research Society agreed that \u201cpeer review works well as it is\u201d (Chubin and Hackett, 1990; p.192). Peer review has also been claimed to be \u201ca non-validated charade whose processes generate results little better than does chance.\u201d But in the same context: \u201cPeer review is central to the organization of modern science \u2026 why not apply scientific [and engineering] methods to the peer review process\u201d (Horrobin, 2001)\u201d. The absence of URLs on those two pieces of research are indicative of one of the problems we are trying to solve.<\/p>\n<p>Peer review remains today in its current form because of history, but in a niche because technology has opened up usage to a mass audience.<\/p>\n<h3>We must build tools that enable credible engagement<\/h3>\n<p>To illustrate our story: we are engaged with the very pressing and complex issue of climate change. At <a href=\"http:\/\/www.amee.org\/\">AMEE<\/a> we codify international, government, and proprietary data, models and methodologies that represent, at the most fundamental level, the algorithms that enable the energy, carbon and environmental cost of consumption and activities to be calculated. AMEE doesn\u2019t just store and re-broadcast data, it performs the calculations based on inputs to the models.<\/p>\n<p>One of our challenges is getting at the raw data in a useful, repeatable, and traceable form. As a result of this, one of the core services we offer to data and standards managers are tools that enable this.<\/p>\n<p>Releasing raw data is vital. There can be no excuse not to. Releasing source code is optional. It\u2019s truly great for open source review, but it\u2019s also dangerous if everyone just re-runs the same code with the same baked-in implicit and explicit assumptions and errors.<\/p>\n<p>This is where data and code deviate substantially. The logic cascade for the interpretation of data is not unary (there is no single interpretation), it is based on assumptions that may vary and are subject to many quantitative and qualitative inputs: the interpretation of the data is not even binary.<\/p>\n<p>We believe it\u2019s much better to publish the following five components to provide transparent and auditable disclosure:<\/p>\n<ol>\n<li>The raw data<\/li>\n<li>The circumstances of its collection<\/li>\n<li>The method and assumptions used to process the data (in words and equations)<\/li>\n<li>The results of the processing<\/li>\n<li>The known limitations on the method and significance of the assumptions<\/li>\n<\/ol>\n<p>The processing code should be written from scratch as many times as possible to reduce the chance that it affected the results in any way.<\/p>\n<p>Once \u201cpublished,\u201d the challenge is the how to build out a credible, and usable, set of services that encourage correct usage.<\/p>\n<h3>Building the solution stack<\/h3>\n<p>At AMEE we have developed a six-tier solution to try and address some of these issues. Specifically, we address the gap between content creators\/managers (e.g. standards bodies) and content users (e.g. software apps, consultants, auditors), with a solution that is both human and machine-readable.<\/p>\n<p><strong>1. Aggregation<\/strong> \u2014 We aggregate the raw data, and track and log the sources. We have a standards <a href=\"http:\/\/www.amee.com\/2010\/01\/28\/amee-carbon-standards-spider\/\">spider<\/a> that checks for changes, not unlike a search engine spider.<\/p>\n<p><strong>2. Content Enhancement<\/strong> \u2014 In the process of aggregation, we document the data, and embed provenance, linking back to the source. We also add <a href=\"http:\/\/explorer.amee.com\/Authority\">authority<\/a>, a measure of the reliability and credibility of the source. We\u2019re beginning to add other taxonomies and semantic links that enable the data to be joined, and are building tools for engagement with the platform to stimulate discussion.<\/p>\n<p><strong>3. Discoverability<\/strong> \u2014 <a href=\"http:\/\/explorer.amee.com\/\">AMEE Explorer<\/a> is the human-readable version of the data, and the only search engine on carbon calculation models (N.B.: we are focused on the industrial and human impacts at the moment, not modeling the climate itself).<\/p>\n<p><strong>4. Repeatable Quality<\/strong> \u2014 We have a quality-control process around the underlying data that is similar to a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Six_Sigma\">Six Sigma<\/a> process. Our systems self-test the data every 30 minutes, and human checks are carried out at random intervals to ensure systemic errors have not been introduced. Our target accuracy metric is 100 percent, not five-nines.<\/p>\n<p><strong>5. Computable Engine<\/strong> \u2014 We believe we are taking the notion of a master database service to an entirely new level by ensuring that not only the data is robust, but AMEE performs the actual calculations. AMEE retains an audit history behind both the inputs and the calculations themselves.<\/p>\n<p><strong>6. Interoperability and auditability<\/strong> \u2014 The AMEE API is the machine-readable version of the data (in fact all of the content including meta data and documentation), which enables the calculations to be done. AMEE also stores the audit-history of both the inputs and the calculation mechanics. For example: PUT a (flight in an F-15 from London to New York at combat thrust), and GET the kgCO2 for that journey, or PUT (1000kWh reported by my Whirlpool fridge for this month, in Washington, using my preferred energy supplier and my solar panels) and GET the kgCO2.<\/p>\n<h3>Challenges<\/h3>\n<p>AMEE is positioned right at the junction between cloud, code, API, content, data, and the usage of the data, and as carbon becomes priced, we believe the consequences of getting it wrong are extremely high.<\/p>\n<p>From an \u201copen\u201d standpoint, one of the big challenges we face includes defining where the boundaries of \u201copen\u201d lie. Our value, of course, is in the ongoing maintenance and reliability of the system, and connecting the data.<\/p>\n<p>Commercially, we are treading very carefully through the platform and use-case stack (core platform, API, data, algorithms, code, structure, etc), and increasing transparency at the most relevant points for the end-user (who needs to feel confident about their own inputs and outputs). It\u2019s a complex stack, and no open source or creative commons licenses wholly cover the kinds of issues we face.<\/p>\n<p>Our field, carbon footprinting, is what we call a \u201cnon-trivial\u201d example of where open data meets the markets: billions of dollars are flowing through or around these data on the carbon markets. For example, thousands of businesses in the UK have to start reporting their carbon footprint to the government this year, and paying for it next year. Very, very few people understand how to use this data, how it all joins together, where the trap doors are, and why it\u2019s important to build an industry-stack to solve the problem.<\/p>\n<p>If we don\u2019t build a credible industry stack, from the ground up, the outcome could be no industry at all (or a tiny one), and that has dire consequences not only for the vendors and businesses in the space (such as SAP, SAS, CA, Microsoft, Google, and others), but also removes our ability to accelerate solving the underlying issue of carbon and climate change itself. Root cause of this credibility-gap has been lack of transparency, and no one has comprehensively joined the dots to see what is real, and what it not.<\/p>\n<p>We also believe this kind of approach has huge value in many areas beyond the ones AMEE is addressing.<\/p>\n<p>Open data isn\u2019t just about re-broadcasting data, but combining it, re-using it and building upon it. It\u2019s about creating new uses, creating new markets and building credibility into the data as it flows.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Science, data, internet, ontology, work and non-work themes converging &#8211; my post on O&#8217;Reilly Radar, reposted below Why open data requires credibility and transparency. by [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1218,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[5,16,21,8,20],"tags":[],"class_list":["post-316","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business","category-climate","category-energy","category-science","category-social-change"],"jetpack_featured_media_url":"https:\/\/dgen.net\/0\/wp-content\/uploads\/2010\/07\/dataNotBinary.png","jetpack_shortlink":"https:\/\/wp.me\/pfJFK3-56","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/posts\/316","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/comments?post=316"}],"version-history":[{"count":5,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/posts\/316\/revisions"}],"predecessor-version":[{"id":2396,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/posts\/316\/revisions\/2396"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/media\/1218"}],"wp:attachment":[{"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/media?parent=316"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/categories?post=316"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dgen.net\/0\/wp-json\/wp\/v2\/tags?post=316"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}