Big Data

Have you ever felt unnerved by the way those Target register coupons seem to know you just a little too well? Wondered whether your doctor’s electronic medical records really keep your health information safer? Pondered how, exactly, your new smart phone can be just so…smart?

November 11, 2015


Big Data - 2015 Story

Welcome to the world of big data, where computers, and the engineers who build them, see you as both “just a number” and anything but.

“Trying to define big data is a bit like trying to define freedom,” said Dr. Greg Speegle, professor and chair of Baylor’s computer science department. “It means different things to different people; anything from cloud computing to analyzing huge sets of information being generated online, minute-to-minute.”

Just how huge? Consider this: the amount of quantifiable, documented information produced by humans from the beginning of recorded time until 2003 is, today, generated anew every 48 hours. Walmart, alone, requires mainframes capable of hosting more than 150 times the information found in the entire Library of Congress, just to track its more than one million customer transactions per hour.

Some of big data, like retail purchase tracking, is by nature definitive. Some, like cheminformatics that virtually test new pharmaceuticals, is predictive. Much more is and continues to trend toward the unstructured. Think social media, whose tweets and tags, rants and raves have birthed a new kind of information that’s highly specific and just as valuable, but increasingly more difficult to harness and, in cases, far less reliable.

“Big data, as we’ve come to know it, creates many opportunities but also many challenges,” said Dr. David Lin, associate professor of computer science. “As a scientist, I use those challenges to help people and to make their lives better, more convenient.”

As recently as just a decade ago, the kind of big data analysis we know today was rare, typically available only to corporations and other large entities that could afford the services of highly specialized software firms. Anymore, Lin said, its application is accessible to businesses of every size and kind. Lin, himself, works for several such companies, in a nutshell turning an unfathomable amount of web-based data into useful information to accurately answer everything from the trivial (Where’s the nearest Whataburger?) to the pivotal (Which candidate’s views best mirror my own?).

Information, he says, doesn’t necessarily mean knowledge. And while the amount of it available is a powerful tool, it also can be misleading.

“Everybody wants their information fast and accurate,” Lin said. “But nobody has time to manually process the sheer amount of data at our fingertips to figure out what is useless and what is useful.”

Through complex computer algorithms, Lin helps the internet make sense of non-numeric data, such as images and social media posts; accurately answer skyline queries (results that change based on the user’s location, preferences, or other variables); and even perform automated fact-checking and spam detection.

His newest project, an as-yet unpublicized partnership with a company in the medical field, attempts to better refine the electronic medical records process, one that is critically prone to human error.

“Let’s say I need surgery,” he posited. “What happens when the surgeon goes to review my medical history, and someone at the hospital mistyped my social security number by one digit? Or if I’m listed as ‘David Lin’ at my doctor’s office and ‘D. Lin’ at the pharmacy? These are not ‘pie in the sky’ problems. We’re using academic techniques to solve real-life issues.”

While his work assists many private-sector clients, Lin believes in creating big data solutions that can be shared publicly and free of charge across industries, technology platforms and user interfaces. As for that privacy question, with big data, he believes there’s a fine line.

“A certain faction believes privacy is a commodity that will be outdated in 10 years, if not sooner,” he said. “Even today, of course, there’s a trade-off. You give Kroger your address so you can get a discount. The incentive is usually enough for people to give away the data. So, is that an invasion of privacy when it’s granted? For me, the question is always, ‘How can we do the best analysis possible with the least invasion of privacy?’”

A focus on scientifically accurate and ethically acceptable big data analysis is, for Lin, a primary reason he chose Baylor and one he tries to emphasize both in his research and his teaching. The opportunity to help “very bright” students grow in a collaborative environment, coupled with the university’s industry-leading systems and software keep him fully equipped in a field that’s ever-growing, literally exponentially.

“To organize, make sense of and ultimately use this tremendous amount of information, we have to have the algorithms,” Speegle said. “Dr. Lin is doing work that allows us to find new and fascinating revelations in data that’s always been there, but never interpreted with clarity and accuracy.”