The latest in data science and machine learning from KDD 2017

The latest in data science and machine learning from KDD 2017


KDD2017 is the acronym for the 23rd Conference on Knowledge Discovery and Data Mining organized by the Association for Computing Machinery (ACM) under their Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD). Together with more than 1600 experts coming from both academia and industry from all over the world, Konica Minolta Labs Europe attended the conference in Halifax, Canada from 13th to 17th August.

KDD is a major conferences about data science, data mining, knowledge discovery, large-scale data analytics and big data. Even though the conference has a strong academic component, it brings together researchers and practitioners from many different companies and industries.

The programme was rich with many interesting topics but my major focus was on conventional and hands-on tutorials introducing algorithms that are mature and robust enough to be applied for the projects that we develop in our Artificial Intelligence and Smart Data Systems research area.

I  was curious to understand more about the level of maturity of the solutions proposed within the conference and I was impressed by the first tutorial I attended, dedicated to ‘Time Series data Mining Using the Matrix Profile’ and organized by Eamonn Keogh from the University of California Riverside and Abdullah Mueen  from the University of New Mexico.

The tutorial presented a unifying view on Motif Discovery, Anomaly Detection, Segmentation, Classification, Clustering, and Similarity Joins over time-series data by exploiting the method of Matrix Profile. The first part of the presentation demonstrated Matrix Profile usage and how it copes with time-series data coming from different domains and the second part was focused on the algorithmic background of Matrix Profile computation. According to the results shown within the course of the tutorial, Matrix Profile could be used both in academia as a tool for time-series analysis fostering further research directions and in commercial prototyping where it can be tuned for specific domains. With my colleagues, we will look to investigate how to apply this methodology in our future projects such as for instance Cognitive Hub.

Red line: artificially created time-series of white noise with inserted sections of similar, but not the same sub-sequences. The blue time-series depicts the calculated Matrix Profile signalising Motifs at ticks with lowest values. Further, the numbers and arcs at the bottom of the figure signalize in which part of the examined time-series the other most similar Motif occurs [A.Mueen, E.Keogh: Tutorial on Time Series Data Mining Using the Matrix Profile. KDD 2017. Online: accessed 2017-08-13]

During KDD, I have also taken part in two workshops dedicated to Anomaly Detection in Finance and ‘Big Data, IoT Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications’ aka KDD BigMine17. A deep learning approach for anomaly detection in time-series was introduced in the BigMine17 workshop, where new kinds of Long Short-Term Memory (LSTM) that enable deep learning with delayed prediction have been proposed.

An Interesting paper presented by Jaroslav Kuchar from the Czech Technical University in Prague and Vojtech Svatek from University of Economics in Prague was about searching for anomalies in EU funds leveraging frequent patterns. Anodot, a company with whom Konica Minolta is collaborating, within the course of KDD demonstrated a solution for anomalies detection using an improved ARIMA model taking into account seasonality of the time-series data.

One of the most interesting talks was a keynote speech by Stanford University Professor Jure Leskovec, who outlined current state-of-the-art solutions on Mining Online Networks and Communities. I was particularly interested in mining graphs evolving in time: Jure illustrated this topic with an example of a graph correlating multiple car sensors and showing how the graph changes in time.

Among the hands-on tutorials I have attended, the most interesting was about ‘Using R for Scalable Data Science: Single Machines to Hadoop Spark Clusters‘. Open-source R packages for running R scripts inside MS SQL Server 2016+ have been made available by Microsoft, thus, data scientists do not need to download datasets from an SQL server to their workstation (causing higher network load and security vulnerabilities). However, they can leverage R processing inside MSSQL database engine and send algorithms to the data, and for instance, they can run data transformation tasks on the server, and download only the results.

The conference about Knowledge Discovery and Data Mining provided me with great opportunities to talk about emerging problems, discuss current trends and exchange experiences about data science and machine learning. I am looking forward to take part in the next KDD conference in London in 2018.