Students from the UCSB Data Science Club were recently invited to HG Insights headquarters to participate in an eight-hour hackathon designed to help them use their skills to tackle real world problems. The event pitted three teams, comprised of UCSB students mixed with experienced members of the engineering team from HG Insights, in a friendly competition against the clock. Given the time crunch, the goal wasn’t so much to win, as it was to give the students an opportunity to test their theoretical knowledge of data science with some real hands-on experimentation creating solutions using large data sets from Yelp, Reddit, and the US Federal Government. While developing their projects, students also had the opportunity to use cutting edge technology such as Amazon Web Services (including Elastic MapReduce), Hadoop, Apache Zeppelin, and Apache Spark.
Data scientists for HG Insights kicked off the event with a hands-on tutorial and technology talk, followed by hands-on project hacking. The festivities also included plenty of laughs, strong shots of espresso, and the consumption of massive amounts of street tacos.
Below is a brief summary of the teams, the projects, and some key learnings:
The first team, named “DeltaPsi” used data from the GDELT project (see http://gdeltproject.org/) to try and identify data fluctuations before, during, and after the 2008 financial crisis. They studied signals available in the event data, creating positive and negative sentiment markers including events such as requests for financial aid, or economic austerity. One interesting observation when breaking out sentiment analysis by country was a huge spike in sentiment noticed specifically in Russia. The team is looking forward to examining if the spike was an anomaly or if it was triggered by a special event.
The second team, named “HG Daddies” wanted to use Natural Language Processing (NLP) to try and predict Yelp review ratings by analyzing the text alone. While the team ran out of time before properly setting up the NLP environment to test their model, they did learn quite a bit about setting up AWS Elastic MapReduce (EMR) clusters, working with Amazon’s Simple Storage Service (S3) and loading up big data sets. They unfortunately never got far enough with getting NLP working in the environment to progress on their project, but they admittedly had a blast and learned a ton in a short period of time. (They also ate a lot of tacos).
The third team, named “Kolmorgorov-Smirnoff Ice” (a play on the famous Kolmorgorov-Smirnoff test which is an algorithm used to determine if two datasets differ significantly. See: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test) attempted to load Reddit data to explore data mining techniques using Apache Zeppelin and Apache Spark. They used machine learning to train a model based on Reddit’s controversiality metric to try and predict controversiality for posts. They then tokenized the data with the hopes of experimenting with NLP techniques such as computing word frequencies, but ran out of time.
At the end of the night, each team gave a presentation of their respective projects, and offered insights into what they had learned.
After HG Insights gave a data science presentation on campus in late 2016, UCSB Data Science Club founder and president, Jason Freeberg, and HG Insights CTO, Rob Fox, both agreed that a hosted hackathon at HG Insights HQ in downtown Santa Barbara, would give the members of the club a real opportunity to work shoulder-to-shoulder with engineers in the field, providing a truly enriching experience.
“It’s important to the culture of HG Insights to give back to the community and help foster the minds of our next generation of data scientists who will someday be shaping the very foundation of the world around us,” said Rob Fox. “Plus the tacos were really good.”