The ubiquity of and demand for data has increased the need for better data tools, and as the tools get better and better, they ease the entry into data work. In turn, as more people enjoy the ease of use, data literacy becomes the norm.
Ginette: “I’m Ginette.”
Curtis: “And I’m Curtis.”
Ginette: “And you are listening to Data Crunch.”
Curtis: “A podcast about how data and prediction shape our world.”
Ginette: “A Vault Analytics production.”
“We have a gift for you this holiday season. We’re giving you, our listeners, a website . . . it’s a website of all the AI applications we come across or hear about in our daily research. We post bite-size snippets about the interesting applications we are finding that we can’t feature on the podcast so that you can stay informed and see how AI is changing the world right now. There are so many interesting ways that AI is being used to change the way people are doing things. For example, did you know that there is an AI application for translating chicken chatter? Or using drones to detect and prevent shark attacks on coastal waters? To experience your holiday gift, go to datacrunchpodcast.com/ai.”
Curtis: “If you’ve listened to our History of Data Science series, you know about the amazing advances in technology behind the leaps we’ve seen in data science over the past several years, and how AI and machine learning are changing the way people work and live.
“But there is another trend that’s also been happening that isn’t talked about as much, and it’s playing an increasingly important role in the story of how data science is changing the world.
“To introduce the topic, we talked with someone who is part of this trend, Nick Goodhartz.”
Nick Goodhartz: “So I went to school at Baylor University, and I studied finance and entrepreneurship and a minor in music. I ended up taking a job with a start-up as a data analyst essentially. So it was an ad technology company that was a broker between websites and advertisers, and so I analyzed all the transactions between those and tried to find out what we are missing.
“We were building out these reports in Excel, but there was a breaking point when we had this report that we all worked off of, but it got too big to even email to each other. It was this massive monolith of an Excel report, and we figured there’s got to be a better way, and someone else on our team had heard of Tableau, and so we got a trial of it. In 14 days we—actually less than 14 days—we were able to get our data into Tableau, take a look at some things we were curious about, and pinpointed a possible customer who had popped their head out and then disappeared. We approached them and signed a half million dollar deal, and that paid for Tableau a hundred times over, so it was one of those moments where you really realize, ‘man, there’s something to this.’
“That’s what got me into Tableau and what changed my mind about data analysis because at school analyzing finance it was nothing but Excel and mindless tables of stock capitalization and all this stuff and what made it fascinating was finding a way to look at it and answer questions on the fly, and then it actually changed the way I look at things around me. I find myself now watching a television show and thinking ‘well this episode wasn’t as interesting. I wonder what the trends of the ratings look like.’ It really has changed the way I think about data because of how easy it’s been to access it.”
Ginette: “Nick is a member of a growing portion of people who didn’t think they’d end up doing analytics. He didn’t have the specific training for it, he doesn’t have a computer science or statistics degree, and he doesn’t spend nights and weekends writing code. And yet, he was able to produce extremely useful insights from his company’s data stores and help land a large business deal. Not only that, he found the process of finding insights from data so fascinating that it spilled over into his leisure time.”
Nick: “So I’ve been in my personal time focusing on areas of data that I find fascinating. One is NBA basketball data. I’m a huge sports fan, and I’ve been taking a look at a lot of the advanced statistics for NBA basketball and trying to compare players’ productivity to how much they’re paid.”
Ginette: “Essentially he wanted to find the ROI on a player.”
Nick: “So I took what’s called their win shares, which is essentially how many wins in a season can one player basically be attributed, and I divided that by how much they were paid per game, and then so I essentially came up with per $100,000 they’re paid, how many wins can they basically be expected to contribute.
“So the Detroit Pistons for the 2014–2015 season had a player named Andre Drummond who in that season was paid two and a half million dollars. His win share per $100,000 was just under a half a win, so right so basically for every $200,000 he was paid, he would be contributing one win. That season the average within the league was .22, so I compare those two and said, “well, if he is higher than the average and he was underpaid.” And then I look down here and said what is this, what is this statistic and how has he trended over his career, and so has it increased or decreased along with his salary?
“A lot of intangible things go into a player salary. Someone like LeBron James isn’t just paid because he’s really good at basketball because he is but also he fills seats like nobody else does and ticket sales really drive salary too. Granted he still seats cuz he’s really good, but he’s also just very popular and that gets inflated and things like that, so this doesn’t necessarily measure that, but I’m not too worried about that. I want to look more at performance.
Curtis: “Tools that allow easy access to working with data are on the rise, and they are allowing people like Nick to take advantage of the data filled world we live in, not just in their careers, but also in their personal interests. You don’t have to be a technical wizard to take part in the data economy. In fact, Tableau has one of the most active and vibrant user groups of any piece of software because of how it changes how people think and allows them to take advantage of data. We talked to one of the most active people in this community, Adam McCann, who has achieved the honorary title of ‘Tableau Zen Master,’ recognizing him as a kind of super contributor in the community. Only about 20 people per year are given this title.”
Adam McCann: “I’ve been using Tableau for about seven years. It makes working with data exciting and interesting. And so then I started my blog about a year or two after that, mainly because I had switched jobs and went to a place where I was doing more managing and mentoring, and I didn’t have the opportunity to actually do as much building in my day-to-day, so I wanted to keep using Tableau, keep learning and the best way I could think to do that was to start up my blog, which is dueling data.
“And that’s primarily the way I became a zen master. . . . So the major way I became a zen master was through my blog, posting how tos or innovative ways of using the software, interesting data visualizations and telling cool stories, and so that’s the main way I think most people end up going the zen master route.”
Ginette: “In addition to his popular How Tos with Tableau, Adam also posts really interesting analysis on his Dueling Data site.”
Adam: “The thing I like about Tableau is you can do anything with it. So on my free time, I tend to focus on things that interest me on a personal level, so a lot of my projects have stuff to do with music and television, and so a lot of the examples you see on my website are based on music and analyzing lyrics, lyric sentiment and trends in music popularity.
“An interesting project I worked on more recently was analyzing bumper stickers, presidential bumper stickers, and looking at the relative relative font sizes of different presidential candidates over the last I think 12 elections, and what are those font sizes tell you about candidate what does it tell you about the likelihood that they will win, so I found that every presidential campaign where the font size of the president was larger than the VP candidate that that candidate won, and in the majority where the reverse was true—that the font size of the president was smaller or equal to the VP—the candidate lost. In fact Hillary Clinton had the most significant disparity between her and Tim Kaine in terms of their font size. Hilary’s was actually smaller when you scale it for the number of letters in their names. It was actually smaller than Tim Kaine’s on their bumper stickers. Whereas Trump’s was significantly larger. I think one of the greatest disparities was significantly larger than his VP candidate, Mike Pence.
“One other thing was every single Republican . . . the largest fonts were republicans and the smallest font tended to be Democrats. I don’t know if that’s interesting or what that might tell you, but I thought that was somewhat interesting. And you know Barack O’Bama actually was the one that had the smallest font relative to the overall size of the bumper sticker. The font of his name was, was actually really small relative to the size of the bumper sticker, whereas Bush-Cheney, Bush’s name is nearly like 70 percent of the overall bumper sticker.”
Curtis: “As Nick and Adam attest to, Tableau is an amazing tool that has made visualizing and gaining insights from data so much easier for so many people. But it doesn’t stop with Tableau, there are many other tools joining the wave to bring the power of data science and AI to the common user. While Tableau helps solve the problem of helping people explore and visualize their data, a newer entrant to the market, a company called Trifacta, tackles the extremely difficult problem of making data preparation easier, which is currently one of the biggest time sinks and least desirable portions of a data scientists’ job. It’s laborious, it’s tedious, and it’s painfully slow. We spoke with Connor Carreras from Trifacta who works with their customers to help them understand and start using the software, and she told us about an important project with the Center for Disease Control, which was an interesting application of Trifacta’s software.”
Connor: “So the group we’re working with at the CDC is a team of epidemiologists, and they are specifically interested in understanding the causes and the spread of HIV outbreaks, and part of the reason that they want to understand this is so that they can figure out which risk factors are contributing to the spread of HIV, or are they able to essentially figure out which populations of people are most vulnerable to HIV so that they can implement different public health programs to either educate or or help prevent the spread of that disease, and so this was the first run-through of this project because this team has been essentially redoing this work for different outbreaks in different regions.
“But the first outbreak that they tackled was one out of Indiana, I believe, so it was a really small county. There were about 24,000 residents, and it’s a rural county, so they had had a history of some opioid drug addicts, so lots of hepatitis C infections in the past, but interestingly in the past there were no HIV, a very small number or practically non-existent amount of HIV in that region, and there was suddenly in 2014 and huge spike in HIV cases in this country, and so the CDC looked at this and said, that’s really odd. Why is this happening? Is there some sort of trend that we can identify that is causing these or that we can start to limit the spread of disease in these areas, and so what the team did is they pulled in a massive amount of data related to this 2014 outbreak, and there’s, there’s a word for this type of data. I recall working with the data scientist out there to parse out the actual genetic codes of of the different cases of HIV, and they’re doing some mapping, some sort of phylogenetic clustering to figure out links between the patients and the outbreak, and a lot of my knowledge of CDC comes from disease disaster Hollywood movies—interestingly, they’re not all that far off in terms of the science. They were looking for patient zero, and then the different hops from that patient to the other patients in the outbreak.
“And so the role that Trifacta played in this is first of all, there’s just a lot of data. So much data that attempting to manually process this or write some sort of a command-line program to, to handle the cleansing and the combining of that data would just be really time-consuming,. This is a use case that really benefited from having a tool like Trifacta where you cut those batch inefficiencies out of the process entirely, so instead of writing your program, pumping your hundreds of gigabytes worth of outbreak data into your initial go at creating a cleansing recipe and then wait potentially hours for this to finish, look at the results and realize that it’s still not right, go back, start again, they were able to really reduce all of those iterations, so look at the date in Trifacta, design their data recipe to cleanse the data. Get that immediate preview of how the cleanse data will look. At that point, see ‘Oh, we’re still dealing with some problems in this column, so let’s tweak our logic or add a new step to address that problem.’ And the turnaround time here was pretty, pretty crucial for CDC because they really wanted to be able to get to the answers quickly so they would be able to set up programs more efficiently to halt the spread of this disease.”
Ginette: “While these tools continue to develop, more and more people will be able to participate and benefit from processing, analyzing, and learning from data. There are also many people working on useful technology tools to bring machine learning and artificial intelligence within the reach of non-technical users, and the first people to come up with a great solution for that will likely generate another landfall of efficiency for data science and machine learning.”
Curtis: “If you’re interested in getting more involved in data but don’t feel like you have the technical skills, check out Tableau and Trifacta. Both offer free downloads and you can get up and going quickly. And incidentally, if you’d like to one up your skills even more, Ginette and I teach a class on Trifacta you can find on Udemy.com—just search for Trifacta and you’ll find it. We’re the only ones currently teaching it on the site.”
Ginette: “A special thanks to Connor Carreras and Paige Schaefer from the Trifacta team and Nick Goodhartz and Adam McCann for speaking with us. As always, go to datacrunchpodcast.com to find our source material and attributions. That’s also where you can go to give us feedback.”