Sunday, March 22, 2015

Uncertain Health in an Insecure World – 29

“Garbage in – Garbage Out”

Ninety percent (90%) of the world’s data has been generated in the last 2 years.

A physician’s assistant just completed my life insurance physical exam and lab work. In 20 minutes or so, he recorded a medical history, measured vital statistics – height, weight, pulse, blood pressure – then collected my blood and urine for biochemistry and other assays. Just imagine how many times per second this type of medical data and personal health information (PHI) is obtained in doctor’s offices, ambulances & hospitals, and entered into databases around the world.

Global healthcare data is growing exponentially.

In fact, the average hospital generates 665 terabytes of medical and PHI data per year! California-based Kaiser Permanente healthcare system has accumulated 40-50 petabytes of insurance and treatment data from its ~9 million members!! The entire U.S. healthcare system will soon exceed a zettabyte (1021 gigabytes) of data!!!

Significant advances in desktop gene (DNA) sequencing and polymerase chain reaction (PCR) testing procedures provide doctors with rapid diagnoses of infectious microbes, drug resistance and rare diseases. These new technologies are also generating massive amounts of data. A key business strategy of some Big Pharma and global medical lab services conglomerates is to use such companion diagnostics data to bring greater treatment efficiency & effectiveness to the healthcare marketplace – so-called precision or personalized medicine.

This is the promise of “Big Data!

International Business Machines (IBM) has defined three big data characteristics: volume (i.e., hard data storage capacity), velocity (i.e., processing time after queries) and variety (i.e., structured, unstructured and semi-structured log-in files required to access internet programs). McKinsey estimates that up to 80% of U.S. healthcare system information being collected is unstructured – medical device recordings, doctor’s notes, monitor & sensor readouts, lab results, imaging studies, clinical outcomes and financial claims data.

In healthcare, there is a fourth big data ‘V’ called veracity

This blog previously noted the insecurity of PHI data (see post #21). Healthcare data quality assurance is also critical. Like financial data, healthcare data (i.e., prescription handwriting) must be error-free and credible. Poor healthcare data quality in a data warehouse can have life & death consequences, especially when using unstructured data.

Analytics is the key value proposition for healthcare big data.

The scaling up of such unstructured healthcare data requires different analytics architecture than available business intelligence tools. Industrial strength big data computing demands distributed processing capabilities across several servers (or nodes), utilizing parallel “divide & process” open architecture computing that has only recently become available.
Rapidly increasing data requires adequate alternate storage capacity. Check!

Data centers can now store data on large servers (100 terabyte capacity) for later processing using code designed for the relevant application. And while 2015 computer hard disk storage (terabytes and even petabytes), random access memory (RAM >16 megabytes) and reading capacity (>100 megabytes per second) have increased 1000-fold since the 1990’s, it was the advent of open architecture computing that made the promise of big data analytics possible. 

The vision of governments and businesses operating in the healthcare sector is to combine big data, advanced computing science and analytics to solve the complexity of chronic disease management, improve clinical trial accuracy and enable personalized treatments in daily medical practice. The only solution to big data becoming useful is prescriptive & predictive analytics (see post#18).

Before the advent of open architecture computing, data was stored on disks and computation was processor-bound. Conventional relational databases were accessed using structured query language (SQL) supported on one server. Programs written in Java script ran data analysis, wherever the processor was located, without sending the program to a data center. Processing speed declined when large data sets were presented to the server at the same time.

Huge data sets generated by web search engines Google and Yahoo led Doug Cutting (below) to invent and the Apache Software Foundation to release the Hadoop Distributed File System (HDFS, 2003) in order to store big data. MapReduce (MAPR, 2004) software can distribute sub-tasks to 100’s or 1,000’s of servers in a Hadoop cluster (i.e., “shuffling”), map initial outputs, then reduce & track these map outputs in parallel processing jobs.

Open architecture "write once, read many" computing ecosystems like Hadoop improve processing time by using multiple servers to handle a large amount of stored data, reducing the time from query to output. Hadoop achieves high processing speed by using servers working in parallel, equalizing processing power to match the huge amount of data generated.

The healthcare sector has been strangely slow to join this big data revolution. Why?

Patient confidentiality concerns have lagged adoption compared to the retail and banking industries. Big data capture by healthcare systems, governments and Big Pharma has paralleled technical open architecture advances in computing science. Ever-increasing healthcare costs raises serious sustainability concerns that can now be confronted by aggregating insurance risks (“bundling”), managing utilization (“right care”) and tracking patient outcomes in linked databases (i.e. Kaiser Permanente’s HealthConnect).

Working with such big data at scale may eventually allow healthcare stakeholders to create value by exchanging efficacy information and incentivising greater efficiency.

Until this happens, it’s healthcare big data “garbage in – garbage out” littering the Square.  

No comments:

Post a Comment