Skip to content
Telekinesis - Data to Data Conversations

Streaming with StreamSets

In this episode we chat with Pat Patterson, StreamSets community champion!

We have another podcast in the pocket, and we are excited to announce that it now has an official name—Telekinesis: Data to Data Conversations. We’re having a lot of fun with this, so be sure to check back regularly to hear our loosely-guided discussions and interviews with all sorts of industry folks on technology innovations that are revolutionizing the data and analytics ecosystem.

In this episode of the Telekinesis podcast, Kinetica CMO Daniel Raskin has the pleasure of interviewing Pat Patterson, the Community Champion for StreamSets. An outgoing personality and self-described “articulate techie,” Pat has been working with Internet technologies since 1997, building software and communities at Sun Microsystems, Huawei, Salesforce, and StreamSets. At Sun, Pat was the community lead for the OpenSSO open source project, while at Huawei he developed cloud storage infrastructure software. As a developer evangelist at Salesforce, Pat focused on identity, integration, and the Internet of Things. Now community champion at StreamSets, Pat is responsible for the care and feeding of the StreamSets open source community. Here’s a brief teaser of our conversation:

Give us a little download about StreamSets. How long has the company been around, and why does it exist? StreamSets was founded in 2014, by Girish Pancha, former chief product officer of Informatica, and Arvind Prabhakar, an early employee and engineering leader at Cloudera. What they realized was that the old world of ETL—which involved building batch jobs to read data out of Oracle and move it to somewhere else—was really breaking down in this world of continuous streaming and big data. Arvind was at Cloudera, and this was a very, very different world from your old enterprise data world, this idea of continuous data arrival.If you think about it, we all want to be working on the very latest data. If you were given the choice between working on data up to five minutes ago and data up to yesterday, what are you going to choose? So Arvind and Girish founded StreamSets in 2014, and I came on in March 2016 as Community Champion. Over time, we built up our customer base and now we have great customers like GlaxoSmithKline and Western Union.

We’ve built out our product line. rom our StreamSets Data Collector core product, which is our open source software for development of data pipelines, we’ve built out a control layer, the StreamSets Control Hub, which gives you a repository for your data pipelines, and lets you control and monitor your data collectors running those pipelines. We also have the StreamSets Dataflow Performance Manager, which gives you stats and metrics on data flow across your organization, and lets you configure SLAs for a range of requirements. We also have StreamSets Data Collector Edge, which is another open source project that allows you to deploy these pipelines into constrained devices. We’re talking about the IoT, ARM-based systems, Raspberry Pi devices, etc., so you can put a pipeline right there on the edge, and start gathering and filtering data and passing it up to the core for more processing. So really, our mission is to build this data operations platform to give you a way to get a handle on all the data that’s in motion in your enterprise.

What are some of the coolest business applications of this that you’re seeing? One of the most interesting is Cox Automotive. They have about 30 subsidiary companies—Kelley Blue Book and Autotrader are the best known—and they have a whole bunch of others that are more special-purpose for dealer networks and checking on vehicle identification numbers, etc. You can imagine that across that group, there are many opportunities for analyzing the data if you can join it all together. If you can bring in all the info from Kelley Blue Book, Autotrader, and so on, you can get a picture of the market for 2012 Camrys, for example. The company had originally set up point-to-point integrations between different subsidiaries. If Autotrader wanted Kelley Blue Book’s data, they would just set up a point-to-point synchronization, and then if VINSolutions wanted some of that data, they would set it up and start getting Autotrader’s data from Kelley Blue Book. The system had a lot of ad-hoc integrations and hand-coded solutions that were very, very brittle. So they decided to build a data lake, where you can bring all this together into one platform for analysis.

They ended up coming on board with StreamSets, and now they have pipelines running at each of their subsidiaries that gather all the data. For example, at Autotrader, one of their Oracle schemas has 1500 tables, and they’re literally pulling everything out of those on an ongoing basis and attaching some standard metadata, so each record that comes up the pipe says when it was ingested, where it was ingested from, what table, and a number of other standard fields. Each record can be different depending on where it’s come from, but it’s all flowing through the same pipelines to a central system that is putting it into Apache Hive.

Part of their central pipeline looks at the structure of each record as it’s coming in; it looks at the Hive schema, and then reconciles the two. So it’ll actually create Hive tables or alter Hive tables on-the-fly as data is arriving. It used to be a real problem where they would spend four hours figuring out, “This piece of data, I know it’s being created over there, but it’s not arriving over here, so somebody would have to go down and diagnosis it and change some hand-coded script or app to handle it.” Now, it just flows all day long.

How can StreamSets provide a performance-standard way to move data from HDFS into Kinetica? One of our customers is the Australian Department of Defense. We literally know nothing about their use case. They file a ticket and say something is wrong, and we ask, “What are you processing? What data are you working with?” and they can’t tell us.

However, one of the things they’re very open about is the technologies that they’re using, and the performance that they’re getting from them. One of their guys was answering a question about pulling data from HDFS, and he said, “We had this dataset, it was two billion rows in HDFS, and I built a pipeline to pull the data, derive some new fields from existing fields, run some Python against each record to do some more complicated processing, assign a UUID to each record, and then write it to a database.” They happened to be using Apache Kudu. He said it took 90 minutes to ingest a billion rows from HDFS, and that’s because when you run your pipeline on Hadoop, it actually runs on every data node, so instead of one pipeline reading data across one connection out of HDFS, you’ve actually got “n” pipelines, where “n” is your number of data nodes. With StreamSets, they were able to get 200k recs/sec from HDFS to Apache Kudu, enriching and transforming the data along the way. I don’t know how many data nodes Australian Department of Defense has in their Hadoop cluster, but you get this massive parallelism, and this guy was just hopping from foot to foot with excitement as he was typing this into Slack, I’m sure.

What does a “Community Champion” do, exactly? As a community champion, my title reflects the fact that this technology is not just for developers; it’s for data scientists, data engineers, and basically anybody that’s wanting to move data from one place to another. Over the past two years, my role has been to act as a bridge between the product team, the engineering team, and our open source community. I’m the guy that gets to build demos, write blog posts, send Tweets, and show up at the conferences. In the past, I could write a screen full of code, write a blog entry about it, and then walk away, which is a luxury for any developer, since you don’t have to support your code.

But at StreamSets, I’ve actually written the Salesforce integration, so no piece of knowledge goes “unpunished” at a startup, and I actually have my hands in the code a little bit more. As the company grows and matures, I’m starting to talk more about benefits rather than features, and solutions rather than technologies, as we are naturally talking with bigger and bigger customers. So it’s an interesting role, it’s a changing role, but it’s always one foot in the technology and then one foot in communicating that to our audience.

As a community champion, I would imagine one of the big things you have to do is figure out how to make your technology stand out from the crowd?

Right. And the answer is Minecraft! It actually goes back to my time at Salesforce, and one of our community members on a Salesforce chat channel said, “It’d be great if you could see Salesforce in Minecraft.” I got to thinking about that, and Neuromancer, the old William Gibson cyberpunk novel, and about representing the real world in the virtual world. So I created a  Minecraft world, where every account in Salesforce is a building, and every deal that you’ve got in progress with that customer is a floor, and then there are levels on the walls for the opportunity states, prospecting, negotiating, etc. The level shows the state and also gives you the ability to pull it.

I implemented this and demoed it to my boss and showed him, okay, if I close the opportunity in Salesforce, in the traditional web interface, I see the level flip, and then a villager appears who’s one of your Salesforce contacts, and drops you gold and gems, so that’s the value of that opportunity. My boss said, “This is the best thing ever. Tidy this up, write a blog post, make a video for YouTube, and let’s get this out, because this is awesome.” You can really see the power of APIs. Minecraft can talk to Salesforce—anything can talk to Salesforce. I showed it at a Dreamforce session, and I also presented it at a Salesforce executive summit in Vegas to a room of 700 Salesforce VPs and Directors.

That was the origin of the Minecraft project, and then when I arrived at StreamSets, we wanted some fun ways to visualize data, and I did one where you were looking at Apache log data, and I had a map of the world that had blocks dropping in over time, doing the GeoIP lookup, IP address, and the latitude/longitude, so you’d almost get a two-dimensional histogram of the world, with piles of blocks where you had the most requests.

Our marketing guy heard about this and said, “Hmm, could you do something with Star Wars, because May the Fourth is coming up, you know, Star Wars Day—May the Fourth Be with You? How about looking at Twitter for mentions of different characters?”

So I built a pipeline to read the Twitter stream on #Star Wars, and then parsed out different character names. I made it so that simple regular expressions say, “If it says Vader, Darth Vader, or whatever, put a tick in the Vader box. If it says Han or Han Solo or Solo, but a tick in the Han Solo box” and so on. I collected two weeks’ worth of Tweets and built a little system where I could read off those mentions and build the character’s portraits in a Minecraft world where it was like a map of Tatooine.

Hope you enjoy listening and here are some additional links to the demos we talked about on the show:

“Forcecraft” Demo – Salesforce and Minecraft

Visualizing Apache Log Data – StreamSets and Minecraft

May the 4th Be With You – Star Wars Tweets

How Pat Built the Star Wars Tweet Demo

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.