When Julia Silge’s personal interests meet her professional proficiencies, she discovers new meaning in Jane Austen’s literature, and she gauges the cultural influence of locations in pop songs. Even more impressive than these finds, though, is that she and her collaborator, Dave Robinson, have developed some new, efficient ways to mine text data. Check out the book they’ve written called Tidy Text Mining with R.
Below is a partial transcript. For the full interview, listen to the podcast episode by selecting the Play button above or by selecting this link, or you can also listen to the podcast through Apple Podcasts, Google Play, Stitcher, and Overcast.
Julia Silge: “One that I worked on that was really fun was about song lyrics. The last 50 years or so of pop songs, we have all these lyrics, so all this text data, and I wanted to ask the question, what places are mentioned more or less often in these pop songs.”
Ginette: “I’m Ginette.”
Curtis: “And I’m Curtis.”
Ginette: “And you are listening to Data Crunch.”
Curtis: “A podcast about how data and prediction shape our world.”
Ginette: “A Vault Analytics production.”
Curtis: “Brought to you by data.world, the social network for data people. Discover and share cool data, connect with interesting people, and work together to solve problems faster at data.world. Whether you’re already a frequent dataset contributor or totally new to data.world, there are several resources you can use to stay in the loop on the latest features, learn new skills, and get support. Check out docs.data.world for up-to-date API documentation, tutorials on SQL, and other query techniques, and much more!”
Ginette: “We hope you’re enjoying some vacation time this summer. We just did, and now Data Crunch is back! To hear the latest from us, add us on Twitter, @datacrunchpod. Today we hear from an exciting guest—someone who is on the cutting edge of data science tool creation, someone exploring and developing new ways to slice and dice difficult data.”
Julia: “My name is Julia Silge, and I’m a data scientist at Stack Overflow. My academic background is in physics and astronomy, but I’ve worked in academia, teaching and doing research, I worked at an ed tech start up, and I’ve made a transition now into data science.”
Ginette: “Stack Overflow, where Julia works, is the largest online community for programmers to learn, share knowledge, and build their careers. It’s a great resource when you need to solve a coding problem or develop new skills.”
Curtis: “Now there are basically two main camps in data science: people who program with R, a statistical programming language, and people who program with Python, a high-level, general purpose language. Both languages have devoted followers, and both do excellent work. Today, we’re looking at R, and Julia is a big name in this space, as is her collaborator Dave Robinson.”
Julia: “Text is increasingly a really important part of our work as people who are involved in data. Text is being generated all the time, at ever faster rates. This unstructured data is becoming a really important part of things that we do. I also am somebody that—my academic background is not in text or literature or natural language processing or anything like that, but I am somebody who’s always been a reader and always been interested in language, and these sort of collection of circumstances kind of all came together to converge that me and Dave decided to develop some tools for making text mining something that people can do within this idiom of people who work using the R programming language. So we’ve developed a package called tidy text.”
Ginette: “Now this particular tool is based on tidy data principles, which is basically organizing data in a uniform way so it’s ready for you to ferret out insights.”
Julia: “There’s a section of people who use tools that are built for dealing with tidy data principles, which means you say I’m going to take my data and I’m going to set it up so that it has a consistent form, so that I can use a consistent set of tools. And I love working with data this way because it makes so many things from initial exploration to modeling to making plots—it just makes, it makes so much of my work flow as a data scientist joyous and delightful. You’re like, ‘Oh look! I can do this, I can do this, I can do this.’ Instead of fighing, and like, ‘Oh, no! I have to fight again.’”
Curtis: “This R add-on was so successful that Julia and Dave wrote a book that was just published by O’Reilly called Tidy Text Mining with R.”
Julia: “So the book is about how to approach text mining using these principles, so it’s about some theoretical things from text mining, so there’s some more general natural language processing things but with a kind of an opinionated take on them, like how would you do it approaching them if you’re using tidy data structures to organize your data there, and then about half of the book are these case studies, so analyses with real-life, real-world messy text data. What do you do from reading it to getting to result at the end.”
Ginette: “Julia has used tidy text to answer some interesting questions.”
Julia: “One that I worked on that was really fun was about song lyrics, the last 50 years or so of pop songs, so we have all these lyrics, so all this text data, and I wanted to ask the question, what places are mentioned more or less often in these pop songs, and this actually came out of a discussion that I’d been having for years with my husband, actually, and he’d be sitting listening to a song, and we’d be like why is Baltimore mentioned so often. It seems like Baltimore, and you know or you’d be like, wow, there sure are a lot of songs about California. It’s just one of these conversations we’ve literally have had for year. One day, I was like, you know, I actually have the ability to measure this quantitatively, so let’s do this. So I took the raw song lyrics, and then transformed them to a tidy data structure, and then after you have them in a tidy data structure, you can use some SQL-style joins and things to look for where are the names of the places.
“So I actually did this for the states is what I started out doing this for and looked for the names of states. So first I looked at what states are just mentioned most often. My tool of choice for visualizations is ggplot2, so I made some maps with ggplot2 and first looked at what states are just mentioned most often by number.”
Above is a partial transcript. For the full interview, listen to the podcast episode by selecting the Play button above or by selecting this link, or you can also listen to the podcast through Apple Podcasts, Google Play, Stitcher, and Overcast.