Data Lifecycle

Related Pages

Meeting Notes

PEG 2013 Jul 29

Dilbert strip read at beginning of ‘Liberate Your Data’ presentation on 2013 July 17 at the Open Group Conference in Philadelphia.  A found a copy of the strip in my office break room – complete with a coffee stain.  It’s from Build a Better Life By Stealing Office Supplies, Dogbert’s Big Book of Business.  The title is “Analysis as a tool to avoid decisions”.  1st frame: a long-bearded Dogbert sagely declaims “The purpose of analysis is to avoid making hard decisions.  Therefore, there can never be too much analysis.”  2nd frame: pointy-haired boss is holding a document while seated at his desk and Dilbert is seated across from him holding a folder.  The boss queries “Did you do a present value analysis?” and Dilbert replies “Yes.”  3rd frame: boss asks “Environmental study?”, Dilbert replies “Yes”, boss “Budget analysis?”, Dilbert “Yes”, boss “Stockholder impact?”, Dilbert “Yes”, boss “Carbon dating?”, and Dilbert “Uh… no”.  4th frame: boss tosses the document in the air with a look of disgust and states “Well, then you’re wasting my time, aren’t you.”

JFB 2013 Jul 18

Very pertinent 2013 Jul 18 Dilbert cartoon.  1st frame: pointy-haired boss says to Dogbert “We’ve been using the Dogbert Offsite Document Service for five years, and frankly, I’m concerned.”  2nd frame: a shot of boxes at the loading dock and a waiting green truck and the overheard statement “Your service trucks look suspiciously like garbage trucks.”  3rd frame: pointy-haired boss says “I would cancel your service if I could find the contract” to which Dogbert replies “It’s in ‘storage’.”

PEG and JFB 2013 May 27

CV and data declutter: though we did not have the words for it at the time we developed CV, part of our motivation was reducing the data clutter causing data overwhelm related to CSR considerations.  CV drastically reduces CSR data clutter which enables organizations to effectively and efficiently include all three types of sustainability considerations in their decisions.  The CV repository further reduces CSR data clutter with a single authoritative source for LEI calculations and reports.  (Another motivation was improving decision outcomes – an indirect result of CSR data decluttering.)

Refer to the ‘Questions’ section of the Data Clutter page for the other items discussed.

PEG and JFB 2013 Apr 26

Semantic world – taxonomy, metadata.

Data governance:

  • retention
  • ownership
    • access and edit control
    • creation of derivatives; how much can you change, incorporating changes in derivatives into the original
  • approval
  • templates
  • tagging, taxonomy
  • criteria for making decisions about data

Being pocket picky – see the Let Go of Clutter comments; in a data context means knowing the defining attributes of the data that you’ll use in making decisions.

Tool to discover data rules – which get fed into governance.  Currency of data – and of metrics.

Slips of paper with taxes.

PEG and JFB 2013 Apr 19

Biblical parable related to data hoarding: Luke 12: 16 – 21.

‘Data Freshness Date’

Hoard due to trauma or loss.

US Census data – Long form completion truly representative? Short form – attempt to get everyone to complete instead of a representative sample.

JFB example of bank reports.

Get rid of emails and files that you could get sued over.

Give people time and incentives to harvest and a way to manage that content.

RCCO (Rational Clear Case Online)… a la presentation content management.

Knowledge management – survive brain drain from retirements and other departures of experienced employees.  [A form of key man insurance.]  Don’t want to just squeeze experienced employees and then replace with cheap, inexperienced, offshore workers: want the experience to keep growing your knowledge base.

Quality control.  Risk reduction.  Service packaging – even if only internal.

PEG and JFB 2013 Apr 05

Rational Clear Case with PowerPoint idea: refer to Clutter Buster Inventions page.

KQED Radio aired a Marketplace segment at 6:30pm on 2013 April 4 titled Big Data, Big Storage Bills (link includes audio and transcript).  (“Big data is the slogan of modern-day marketing. But as companies collect more and more information on their customers, the cost of storing this data is soaring to the tune of $70 billion a year.”)  One data point in the segment was that 25% of IT budgets is for storage.

How get around the linearity of PowerPoint (vs. Prezi)?

Put tags in a comment or maybe in speaker’s notes.  Both content and format tags.

Don’t need to save new revisions of presentations but instead build instructions [a ‘recipe’ and the necessary ‘ingredients’ to ‘cook’ a presentation on demand].

Add-on database includes rules for the slide masters.

PowerPoint – discrete component manufacturing.  Add-on assembly.  Use ppt macros?  Import ppt into assembly engine and database.  Auto-tagging.  Store content used in external tools (an XYZ chart, for example) and all resources needed to maintain / modify that content.Event processing to implement rules to make decisions.

PEG 2013 Apr 05

(Refer to the Liberate Your Data page for the abstract accepted for the Open Group conference in Philadelphia 2013 July 17.)

PEG and JFB 2013 Mar 15

Better personal inventory systems – ex. people that only own 100/0 things; ex. people in micro apartments.

Measure how many things.  Measure how many things you use, frequency of use, and probability of usage.  Evaluate cost of storage and cost of disposal versus cost of replacement / rental / borrow factored by probability of need and probability that item (or a reasonable substitute) won’t be available.

Reuse sustainability.

We keep stuff “just in case” or “in case someone needs it”.  Renting instead of owning.  Borrowing.

Data reduction versus organization: for example, mailing links instead of attaching material.

Designating authoritative sources not only reduces clutter but also (can) improve accessibility and coherency.

PEG and JFB 2013 Feb 04

Cost-benefit analysis.  Forcing conference call host to dial-in.  Have to dial correct local access number.

PEG 2013 Jan 22

(Refer to the ‘Other Abstracts’ section of the Liberate Your Data page for an older abstract submitted on the topic of data clutter.)

PEG and JFB 2012 Nov

The American Community Survey has replaced the long form census in the United States.

Facebook’s marginal cost of archiving data vs. the revenue derived from that data.  Leads to questions of too much data and the risks of storing data.  On The Media segments on Facebook which aired 2012 October 26: Facebook’s collection of data from people who visit Facebook sites (not just those Facebook accounts), “Life in Facebookistan“, and “That Little Thing Called “Like”“.)

Polling in the US – see Nate Silver’s blog on somewhat artificial convergence of poll results since pollsters don’t want to be the outliers.  On a different point Nate has a database that goes back to the ’80s.

Nate Silver 2012 Oct 5 interview on “The Pitfalls of Prediction” on On The Media

http://fivethirtyeight.blogs.nytimes.com/
http://fivethirtyeight.blogs.nytimes.com/2012/10/27/oct-26-state-poll-averages-usually-call-election-right/

Montreal bike share program doesn’t provide historical data but an external party created an app to scrape current data from the bike share site every 15 minutes.  The scraped data is stored and shared.

Insurance companies use shared actuarial tables but each have their unique algorithms to make decisions based on that data.  Maybe election prognosticators should share more data.

Average wage higher in US than in Canada but the mean wage is about the same.

JFB 2012 November 12

A way we could work on this topic…

Structure:

– Establish waste that extraneaous data brings on (in terms of GHG/expenses/process and person-hours of work, but also poor decision making)

-There is a shift from the past: (1) first we were collecting little data and analysing it ad-hoc; then (2) we realised we could collect a ton of it cheaply and decided we’d store it until we analysed later.  Example is unlimited Gmail data, etc… This turned out to be a bad idea since we collect so much data we don’t know what to do with it (example: smart meters), so we have to move to paradigm (3) where we stream data and decide what to keep and what to toss out on the fly, even adding best-before dates, whether new information superseeds other information, or if metadata is better to keep than individual sources i.e. keep the 538 analysis and eventually toss the individual detail… not sure if this is a good example though.

– If there is a cost to doing things the way we are doing them, and there is a shift in paradigm on data management because of it, we can argue not what is happening, but rather how that shift should be done. We could argue that (a) a pure cost analysis is too narrow, and (b) an approach that relies solely on potential utility of data might be too broad (ie that is what got us into trouble in the first place). What we have to do it figure out how individual data pieces contribute to improvement of life expectancy and determine their value based on that potential contribution.

JFB and PEG 2012 November 12

Data Lifecycle: decide what data is needed and why; collect the data; filter; extract value; store; and re-purpose.  The ‘store’ step includes decisions about what, where, how long, and how accessible.  The ‘re-purpose’ step includes reusing and recycling the data; for example, toss a data point (?), hand off data set to a repository to be used in a different way and/or in combination with other data sets submitted (e.g. actuarial tables), and use the data in analysis to derive insights (i.e. produce more data).

  • What data do we need, and what data do we need to keep?
  • Waste from a process – a la Lean Sigma
  • Case study: Smart meters in Montreal (?).  The city’s modus operandi was to collect data for later analysis.  They waited until the end of the annual cycle but then found that they had way too much data.  So the city started to collect data more intelligently.  Note that the problem was addressed by reducing amount of data collected rather than filtering the data after collection.  The city also changed the archive parameters.
  • US libraries’ very short-term data storage to avoid having readers’ checkout histories subpoenaed by an investigative unit such as the FBI.  (This was a response to events in the wake of the passage of the Patriot Act.)
  • Nixon’s tapes; Bush and ‘Cheney’s disappearing’ emails.
  • Orwellian scenarios.
  • HUAC’s (Harry Belafonte example) and FBI’s (John Lennon example) over-collection of data.  Share the data collected with the public in a more timely fashion to increase the value of that data.
  • Declassification of data – US Federal Government in the 1990’s; IBM (US) in late 1990’s (?) – including revised guidelines as to what data should be kept and for how long; clarification of the use of the ‘IBM Confidential’ and ‘IBM Internal Use Only’ classifications; NDA requirements; Type I and II deliverables in service contracts.
  • WHat needs to be on paper vs. electronic.
  • If data is sorted into a folder, am I ever going to read it?  At what point should I unsubscribe?  When should I buy books and when should I use the library or a private sharing mechanism?
  • Green Dimes blocks a lot of junk mail, the Do Not Call list in the US blocks a lot of junk phone calls, and spam filters block a lot of email.
  • Information hoarders have stacks of magazines and newspapers to sort through.  We should unsubscribe to emails, magazines, and newspapers that we don’t get around to reading in a modest period of time.
  • The visible information hoarding may have gone down, but often that’s just due to a shift from print to electronic subscriptions.  We’re more likely to subscribe to electronic media.  The end result is that we end up with even more data and so are even less likely to get value out of it.
  • Want more data?  Make a rule that you have to toss some data before acquiring more.
  • Ostensibly tools help you handle more data, but having those tools doesn’t necessarily mean that you’ll have higher value data and/or extract more value from that data…  Potential and actual (kinetic) value of data.
  • The challenges of getting notes translated into a useful form and in a timely fashion.
  • Collaboration tools enable users to take data and put it into a different context and/or to present it to a different audience.

 

PEG 2012 November 14

In a discussion this morning about US General Allen’s email exchange with Jill Kelley on the KQED Radio Forum show the observation was made that in this era there’s too much information available – 30,000 pages of emails in this case.

In the 2012 November 12 print edition of the Christian Science Monitor, Chris Gaylord writes in the “Good to Know” column about Wolfram Alpha’s new abilities to mine your Facebook activity (and your Facebook friends’ publicly available (in Facebook) information).  He provides a graphic of his ‘Friend Cloud’ created by Wolfram Alpha; Wolfram Alpha predicts how his friends are connected to him and to each other based on their Facebook information and activities.  I’m very uncomfortable with this sort of Facebook mining tool: I can see lots of unpleasant uses (a la Hoover and McCarthy) but no positive uses.

The “When no news is bad news” cover story of the 2012 November 12 print edition of the Christian Science Monitor talks about the damaging consequences of not enough data.  In this case ‘not enough data’ means that too many communities lack timely publication of the results of local investigative journalism in media readily available to a broad cross-section of those communities.  The decrease in access to these results is correlated to declines in civic engagement and to civic malfeasance.  The article’s concluding quote is “…does a community cease to be relevant when there’s no newspaper?”.

The following quote is from the first edition (copyright 2012) of Colin Powell’s book It Worked for Me.  “Google’s corporate mission statement is identical with its purpose: “to organize the world’s information and make it universally accessible and useful.”  The founders set out to serve society, and created a remarkably successful company.”

PEG 2012 November 15

To the Best of Our Knowledge aired a program on 2012 November 4 about Memory and Forgetting.  One point made was that people who don’t forget can be crippled.  In a classic case a guy who wanted to be a journalist ended up dropping out of the workforce essentially because he couldn’t distinguish between the important and unimportant details – he didn’t forget what was unimportant.  The blurb on this episode of the show from the KQED Radio site: “To the Best of Our Knowledge Memory and Forgetting How clearly do you remember your last vacation? Or what you did last weekend? Do you think your memory is an exact record of what actually happened? Chances are, it’s not. Scientists have found that with every act of remembering our brains produce new neural circuits, creating new memories. The show talks with two Nobel laureates about the science of memory.”

  • Hoarding – for example, Scots and US Depression Era folks.  (My paternal grandfather, who was both Scottish and a child of the Depression (and came from a farming family), saved an array of parts and construction supplies just in case.  We found that he was making fuel for the wood stove warming the front parlor by rolling used newspapers and soaking them in used motor oil – do not try this at home!  The tendencies may be genetic in that both his sons have exhibited hoarding behaviors.  My brother might if he were more mechanically inclined; my sister-in-law balances him out by tossing all the potential fix-it projects.)
  • Sharing Communities.  Public (and private) libraries.  Regifting.
  • Can you truly have a light footprint if you have a lot of stored data (excluding knowledge in your brain) regardless of your physical, tangible possessions?
  • Data promiscuity

 

PEG 2012 November 18

Are You Oversharer Online? How to Tell” is a 2012 March 26 response (on Inc.) to an NPR story on a job seeker who had to give his prospective employer his Facebook username and password in order to get hired.  The post is mostly a review of the author’s trial of Secure.me, a reputation protection tool.  The blogger and the tool make some good points, but IMHO don’t address the biggest problems of oversharing: too much information to adequately control or to sustainably store.  One commenter recommends maintaining a sufficient online presence to claim certain accounts in order to preserve one’s identity from false claimants.

Naked Data

The 2012 November 16 On the Media segment on the latest twist on revenge porn sites includes an interview with the new site’s owner.  He envisions a future in which everyone is pictured naked with the rest of their identifying data on the Web.  IMHO that’s TMI to the max.

PEG 2012 November 19

This morning’s (10am PST) Forum program on KQED Radio was about tiny apartments.  One of the guests made a good point about a person’s stuff expanding to fit their living space.  One of the online commentators said that living in a small (200 square feet) apartment caused her to lose weight – both because she was getting out more and because a smaller space called for a smaller person.  I also liked the guests’ suggestions that small apartments be designed with built-in lighting and storage (to optimize floor and wall space), and that a person moving into a small space keep only his/her most important collection, and they he/she pare down that collection such that it can grow by 10% in the new space.  I think that this tiny apartment scenario offers good analogies for more sustainable data ‘living space’.

JFB and PEG 2012 November 26

In the context of the data lifecycle, Jevons paradox could be interpreted to mean that technological improvements in the data consumption efficiency (via improvements in data indexing and searching, data access devices (e.g. the iPad), and data distribution) have been more than offset by an increase data consumption.  (“Raw” data isn’t the fuel being consumed; processed (refined) data is the fuel – data in context, data with value in the current state.)  Further applying Jevons paradox means that we need public policy controls to keep the costs of data consumption high enough to reduce data consumption to the target (macroeconomic) level.

Note that I (PEG) haven’t included data storage efficiency and/or data storage cost (via cloud typically) improvements in the efficiency improvements.  I decided not to – though I could be convinced otherwise – because I felt that storage improvements don’t directly increase data consumption efficiency.  Perhaps we should include storage improvements which improve data indexing and searching – for example shorter seek times – but not those which increase storage capacity, decrease the cost per GB, and/or increase availability.

JFB and PEG 2012 December 7

Google exec Eric Schmidt in a conversation at the Computer History Museum (and a broadcast which aired on KQED Radio on Dec 6) makes some interesting points about data generation and access.

With a flat data set indexing will only get you so far.

(? This date…?) Ex-Senator Chris Dodd was interviewed at the Commonwealth Club by Gavin Newsom on 2012 Oct 2.  The interview aired on KQED Radio 2012 Nov 30 at 8pm.  Dodd made some interesting points regarding over-communicating and the form of communication.  The KQED blurb for the talk was: “Commonwealth Club Chris Dodd: Creative Content and the Cloud The program’s guest is Chris Dodd, chairman and CEO of the Motion Picture Association of America and former U.S. senator from Connecticut. Dodd is charged with advocating for the film, home entertainment and television industries around the world. The MPAA represents one of the most creative, productive and powerful industries in America — one that supports 2.2 million creators and makers in every state and particularly in California. Dodd will discuss why technology and creative communities are essential to the economic well-being of their industries, consumers and the country.”.  The Commonwealth Club audio archive of the talk.

Dodd talked about paying attention to painstakingly handwritten letters vs. postcard mailers and the like.  He possibly implied that though today’s communication enables instant protests, it has less meaning because it’s too easily and thoughtlessly done.  (Witness all of the change.org and CREDO calls to action.  And the White House petition site.)

Twitter limits message length, but is the end-result more data with less value?  A shorter message length usually means people send more messages for less reason to more people; senders are broadcasting rather than tailoring a message for a specific recipient.

PEG 2012 December 13

LinkedIn today notified me that I was automatically unsubscribed from receiving weekly digests of the comments to one of the LinkedIn groups to which I belong because I haven’t recently visited the page for that group in LinkedIn.  To quote, “To help keep your email tidy, we’ve unsubscribed you from this group’s digests – we hope this suits you better!”.  Is LinkedIn altruistically helping me to reduce my data clutter or are they trying to drive more activity in their site in parallel with identifying active (from a LinkedIn perspective) members in LinkedIn groups?  At least they make it easy to reactivate the subscription.  Did my reactivation convert into another data point for LinkedIn?  In my case I do scan the weekly digests but obviously the digest content hasn’t motivated me to click-through to LinkedIn.  (In general digests occasionally motivate me to click-through in order to comment on a conversation.  Less frequently they motivate me to click-through in order to ‘Read more’; if an item catches my interest I’m more likely to research it outside of LinkedIn rather than click-through LinkedIn to a specific piece of third-party content.  I figure that’s fair if the scope of my research is much broader than the one item referenced by the click-through.)

If LinkedIn were truly interested in reducing my data clutter then they’d find a better way of providing access and structure to data than sending periodic emails of potentially random comments made by members of a particular group in that group’s LinkedIn page.  For example, how about a way to post a comment across multiple groups?  How about in a sidebar to a post to a LinkedIn group page LinkedIn automatically provide links to related posts within that LinkedIn group, in other LinkedIn groups, and to items external to LinkedIn (especially to those to which similar posts have most often referenced)?  How about a mechanism to automatically estimate the value to a group of a post, and then mechanisms for group members and non-group members to adjust that valuation (with the automatic, member, and non-member ratings appearing with the post)?  And then how about a feedback mechanism whereby LinkedIn modifies that particular group’s value rating criteria in order to more accurately evaluate new posts?  And finally, how about filtering digests based on the value ratings?  (Maybe also a filter based on the number (and value??) of references… which makes me think that references could and should also be value rated.)

PEG & JFB 2013 Jan 7

In the “Damien Jurado – Mixtape Master” segment of Snap Judgement episode #104 (2013 Jan 4), the narrator tapes over mix tapes provided by an acquaintance-of-a-friend only realizing after the fact that the acquaintance was a pre-Nirvana Kurt Cobain, and that the reverse sides were Cobain’s early demo tapes.  This story offers a reminder that most folks – especially cash-strapped teens – as recently as 20 years ago had a finite amount of audio and video data storage capacity: cassette and VHS tapes were reused, cleaned, and even repaired until replacement became necessary.  Today we electronically record (copy) material ad infinitum for later listening and/or viewing stuff – no hard choices about what to overwrite.  On the plus side we don’t lose material through inadvertent or intentional overwriting; on the minus side we lose material in resulting data clutter – we never have to assign a relative value to items in our collections and we’re constantly adding to rather than reusing our storage media.  JFB provided the example of, while away on vacation, ‘taping’ holiday programs for later viewing with a high probability of overlap with the as-yet unwatched programs from last year’s taping.

JFB recently procured a duplicate file locator as a means of reducing data clutter.

PEG 2013 Jan 18

From Forget YOLO: Why ‘Big Data’ Should Be The Word Of The Year by Geoff Nunberg on NPR on 2012 December 20: “What’s new is the way data is generated and processed. It’s like dust in that regard, too. We kick up clouds of it wherever we go. Cellphones and cable boxes; Google and Amazon, Facebook and Twitter; cable boxes and the cameras at stoplights; the bar codes on milk cartons; and the RFID chip that whips you through the toll plaza — each of them captures a sliver of what we’re doing, and nowadays they’re all calling home.”.

PEG 2013 Jan 19

From the CS Monitor 2013 January 14 “Six Picks” column: “Gretchen Rubin, author of “The Happiness Project,” turns her attention to being more content in her abode in her book “Happier at Home.” Ms. Rubin focuses on realizing how much she has to be grateful for in her house already, as well as making it easier to feel satisfied there. One step: Get rid of clutter. “In many cases, my possessions blocked my view and weighed me down,” she wrote. “I wanted to feel more in control of stuff.” Check out her book if you’re craving some ideas to help you make a fresh start in your own home.”

PEG 2013 Feb 26

Let Go of Clutter by Harriet Schechter and the companion book site and The Miracle Worker Organizing site.  See ‘Data Clutter’ child page.

PEG 2013 Apr 05

In the 2013 April 4 Marketplace clip titled “Big Data creates big industry for storing data” Stacey Vanek Smith references a 2012 Aberdeen Group study on data storage costs and visits the rapidly expanding Switch data centers outside of Las Vegas. She quotes Dick Csaplar of the Aberdeen Group.  [Interestingly, Aberdeen Group is part of Harte-Hanks, the big advertising and market intelligence (i.e. data) firm.]  Key Csaplar quote: data storage currently accounts for [at least] 12% of IT budgets and the need for data storage is doubling every two years.

Tape: The Ultimate Storage Tier“, May 2012

How Much of Your Data Should be in the Public Cloud?“, November 2011

Related:

A 2013 March 25 Computerworld article titled “Storage administrators demand simplicity” by Kevin Fogarty includes quotes on data storage by Dick Csaplar, Aberdeen Group and Ashish Nadkarni, IDC.

White paper: “EMC Deduplication for Backup, Recovery, and Long-term Information Retention

Locked Aberdeen Group reports:

http://www.aberdeen.com/Aberdeen-Library/8106/SI-cloud-storage-adoption.aspx

http://www.aberdeen.com/Aberdeen-Library/8109/AI-cloud-storage-gateways.aspx

http://www.aberdeen.com/Aberdeen-Library/8145/RB-master-data-management.aspx

JFB & PEG 2013 Apr 19

 

PEG 2013 Apr 20

Radiolab 2008 November 17 episode titled Choice.

This American Life 2013 April 19 episode titled Picture Show. covers both the Boston Bombing image search for suspects on Reddit and Israeli soldiers mapping neighbors – collecting data – as a means to an end rather than for the sake of the data (which gets thrown away).

Daniel Kahneman: Thinking Fast and Slow

Edison and the light bulb – http://www.kqed.org/a/radiospecials/R201304120200

 Reference

May 2013 McKinsey Forum article by Dorian Stone titled “Customer journey analytics and Big Data” makes the distinction between reporting and analytics: reporting simply presents data while analytics processes the data into input to improve decision timeliness and outcomes.

6 Replies to “Data Lifecycle”

  1. Reflex to accumulate material things as a parallel to acquiring and storing data – but with the data the costs of storing are less visible. Data storage to a point looks like its free. Free to archive email so do it. What if you archived all your physical mail in a ‘to read’ basket? Maybe our relationship to digital hasn’t grown up yet?

  2. What are the costs of storing data? Depending on your application and the size of your mailbox, your mail app can run really slowly. Makes backing up mailbox a hassle. Searches are a lot slower – even if indexed.
    Cost of deleting emails that you decide you don’t want to read.
    Rule to delete or archive all but the n most recent emails from a particular source and/or on a particular subject.
    Way to combine email threads.
    Why don’t we use the rules available? Effort required to setup deletion.

  3. Costs of finding information when there’s more stuff to sort through – and the veracity and/or authoritative source is questionable.

Leave a Reply

Your email address will not be published. Required fields are marked *