Rittman Mead Consulting

Subscribe to Rittman Mead Consulting feed
Rittman Mead consults, trains, and innovates within the world of Oracle Business Intelligence, data integration, and analytics.
Updated: 1 hour 18 min ago

ChitChat for OBIEE - Now Available as Open Source!

Fri, 2018-06-15 03:20

ChitChat is the Rittman Mead commentary tool for OBIEE. ChitChat enhances the BI experience by bridging conversational capabilities into the BI dashboard, increasing ease-of-use and seamlessly joining current workflows. From tracking the history behind analytical results to commenting on specific reports, ChitChat provides a multi-tiered platform built into the BI dashboard that creates a more collaborative and dynamic environment for discussion.

Today we're pleased to announce the release into open-source of ChitChat! You can find the github repository here: https://github.com/RittmanMead/ChitChat

Highlights of the features that ChitChat provides includes:

  • Annotate - ChitChat's multi-tiered annotation capabilities allow BI users to leave comments where they belong, at the source of the conversation inside the BI ecosystem.

  • Document - ChitChat introduces the ability to include documentation inside your BI environment for when you need more that a comment. Keeping key materials contained inside the dashboard gives the right people access to key information without searching.

  • Share - ChitChat allows to bring attention to important information on the dashboard using the channel or workflow manager you prefer.

  • Verified Compatibility - ChitChat has been tested against popular browsers, operating systems, and database platforms for maximum compatibility.

Getting Started

In order to use ChitChat you will need OBIEE, or 12.2.1.x.

First, download the application and unzip it to a convenient access location in the OBIEE server, such as a home directory or the desktop.

See the Installation Guide for full detail on how to install ChitChat.

Database Setup

Build the required database tables using the installer:

cd /home/federico/ChitChatInstaller  
java -jar SocializeInstaller.jar -Method:BuildDatabase -DatabasePath:/app/oracle/oradata/ORCLDB/ORCLPDB1/ -JDBC:"jdbc:oracle:thin:@" -DatabaseUser:"sys as sysdba" -DatabasePassword:password -NewDBUserPassword:password1  

The installer will create a new user (RMREP), and tables required for the application to operate correctly. -DatabasePath flag tells the installer where to place the datafiles for ChitChat in your database server. -JDBC indicates what JDBC driver to use, followed by a colon and the JDBC string to connect to your database. -DatabaseUser specifies the user to access the database with. -DatabasePassword specifies the password for the user previously given. -NewDBUserPassword indicates the password for the new user (RMREP) being created.

WebLogic Data Source Setup

Add a Data Source object to WebLogic using WLST:

cd /home/federico/ChitChatInstaller/jndiInstaller  
$ORACLE_HOME/oracle_common/common/bin/wlst.sh ./create-ds.py

To use this script, modify the ds.properties file using the method of your choice. The following parameters must be updated to reflect your installation: domain.name, admin.url, admin.userName, admin.password, datasource.target, datasource.url and datasource.password.

Deploying the Application on WebLogic

Deploy the application to WebLogic using WLST:

cd /home/federico/ChitChatInstaller  
$ORACLE_HOME/oracle_common/common/bin/wlst.sh ./deploySocialize.py

To use this script, modify the deploySocialize.py file using the method of your choice. The first line must be updated with username, password and url to connect to your Weblogic Server instance. The second parameter in deploy command must be updated to reflect your ChitChat access location.

Configuring the Application

ChitChat requires several several configuration parameters to allow the application to operate successfully. To change the configuration, you must log in to the database schema as the RMREP user, and update the values manually into the APPLICATION_CONSTANT table.

See the Installation Guide for full detail on the available configuration and integration options.

Enabling the Application

To use ChitChat, you must add a small block of code on any given dashboard (in a new column on the right-side of the dashboard) where you want to have the application enabled:

<rm id="socializePageParams"  
<script src="/Socialize/js/dashboard.js"></script>  

Congratulations! You have successfully installed the Rittman Mead commentary tool. To use the application to its fullest capabilities, please refer to the User Guide.


Please raise any issues on the github issue tracker. This is open source, so bear in mind that it's no-one's "job" to maintain the code - it's open to the community to use, benefit from, and maintain.

If you'd like specific help with an implementation, Rittman Mead would be delighted to assist - please do get in touch with Jon Mead or DM us on Twitter @rittmanmead to get access to our Slack channel for support about ChitChat.

Please contact us on the same channels to request a demo.

Categories: BI & Warehousing

Real-time Sailing Yacht Performance - Kafka (Part 2)

Thu, 2018-06-14 08:18

In the last two blogs Getting Started (Part 1) and Stepping back a bit (Part 1.1) I looked at what data I could source from the boat's instrumentation and introduced some new hardware to the boat to support the analysis.

Just to recap I am looking to create the yachts Polars with a view to improving our knowledge of her abilities (whether we can use this to improve our race performance is another matter).

Polars give us a plot of the boat's speed given a true wind speed and angle. This, in turn, informs us of the optimal speed the boat could achieve at any particular angle to wind and wind speed.

Image Description

In the first blog I wrote a reader in Python that takes messages from a TCP/IP feed and writes the data to a file. The reader is able, using a hash key to validate each message (See Getting Started (Part 1)). I'm also converting valid messages into a JSON format so that I can push meaningful structured data downstream. In this blog, I'll cover the architecture and considerations around the setup of Kafka for this use case. I will not cover the installation of each component, there has been a lot written in this area. (We have some internal IP to help with configuration). I discuss the process I went through to get the data in real time displayed in a Grafana dashboard.

Introducing Kafka

I have introduced Kafka into the architecture as a next step.

Why Kafka?

I would like to be able to stream this data real time and don't want to build my own batch mechanism or create a publish/ subscribe model. With Kafka I don't need to check that messages have been successfully received and if there is a failure while consuming messages the consumers will keep track of what has been consumed. If a consumer fails it can be restarted and it will pick up where it left off (consumer offset stored in Kafka as a topic). In the future, I could scale out the platform and introduce some resilience through clustering and replication (this shouldn't be required for a while). Kafka therefore is saving me a lot of manual engineering and will support future growth (should I come into money and am able to afford more sensors for the boat).

High level architecture

Let's look at the high-level components and how they fit together. Firstly I have the instruments transmitting on wireless TCP/IP and these messages are read using my Python I wrote earlier in the year.

I have enhanced the Python I wrote to read and translate the messages and instead of writing to a file I stream the JSON messages to a topic in Kafka.

Once the messages are in Kafka I use Kafka Connect to stream the data into InfluxDB. The messages are written to topic-specific measurements (tables in InfluxdDB).

Grafana is used to display incoming messages in real time.

Kafka components

I am running the application on a MacBook Pro. Basically a single node instance with zookeeper, Kafka broker and a Kafka connect worker. This is the minimum setup with very little resilience.

Image Description

In summary

ZooKeeper is an open-source server that enables distributed coordination of configuration information. In the Kafka architecture ZooKeeper stores metadata about brokers, topics, partitions and their locations. ZooKeeper is configured in zookeeper.properties.

Kafka broker is a single Kafka server.

"The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk." 1

The broker is configured in server.properties. In this setup I have set auto.create.topics.enabled=false. Setting this to false gives me control over the environment as the name suggests it disables the auto-creation of a topic which in turn could lead to confusion.

Kafka connect worker allows us to take advantage of predefined connectors that enable the writing of messages to known external datastores from Kafka. The worker is a wrapper around a Kafka consumer. A consumer is able to read messages from a topic partition using offsets. Offsets keep track of what has been read by a particular consumer or consumer group. (Kafka connect workers can also write to Kafka from datastores but I am not using this functionality in this instance). The connect worker is configured in connect-distributed-properties. I have defined the location of the plugins in this configuration file. Connector definitions are used to determine how to write to an external data source.

Producer to InfluxDB

I use kafka-python to stream the messages into kafka. Within kafka-python there is a KafkaProducer that is intended to work in a similar way to the official java client.

I have created a producer for each message type (parameterised code). Although each producer reads the entire stream from the TCP/IP port it only processes it's assigned message type (wind or speed) this increasing parallelism and therefore throughput.

  producer = KafkaProducer(bootstrap_servers='localhost:9092' , value_serializer=lambda v: json.dumps(v).encode('utf-8'))
  producer.send(topic, json_str) 

I have created a topic per message type with a single partition. Using a single partition per topic guarantees I will consume messages in the order they arrive. There are other ways to increase the number of partitions and still maintain the read order but for this use case a topic per message type seemed to make sense. I basically have optimised throughput (well enough for the number of messages I am trying to process).

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wind-json

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic speed-json

kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic gps-json 

When defining a topic you specify the replaication-factor and the number of partitions.

The topic-level configuration is replication.factor. At the broker level, you control the default.replication.factor for automatically created topics. 1 (I have turned off the automatic creation of topics).

The messages are consumed using Stream reactor which has an InfluxDB sink mechanism and writes directly to the measurements within a performance database I have created. The following parameters showing the topics and inset mechanism are configured in performance.influxdb-sink.properties.


connect.influx.kcql=INSERT INTO wind SELECT * FROM wind-json WITHTIMESTAMP sys_time();INSERT INTO speed SELECT * FROM speed-json WITHTIMESTAMP sys_time();INSERT INTO gps SELECT * FROM gps-json WITHTIMESTAMP sys_time()

The following diagram shows the detail from producer to InfluxDB.

If we now run the producers we get data streaming through the platform.

Producer Python log showing JSON formatted messages:

Status of consumers show minor lag reading from two topics, the describe also shows the current offsets for each consumer task and partitions being consumed (if we had a cluster it would show multiple hosts):

Inspecting the InfluxDB measurements:

When inserting into a measurement in InfluxDB if the measurement does not exist it gets created automatically. The datatypes of the fields are determined from the JSON object being inserted. I needed to adjust the creation of the JSON message to cast the values to floats otherwise I ended up with the wrong types. This caused reporting issues in Grafana. This would be a good case for using Avro and Schema Registry to handle these definitions.

The following gif shows Grafana displaying some of the wind and speed measurements using a D3 Gauge plugin with the producers running to the right of the dials.

Next Steps

I'm now ready to do some real-life testing on our next sailing passage.

In the next blog, I will look at making the setup more resilient to failure and how to monitor and automatically recover from some of these failures. I will also introduce the WorldMap pannel to Grafana so I can plot the location the readings were taken and overlay tidal data.

Categories: BI & Warehousing

OAC - Thoughts on Moving to the Cloud

Tue, 2018-06-12 12:43

Last week, I spent a couple of days with Oracle at Thames Valley Park and this presented me with a perfect opportunity to sit down and get to grips with the full extent of the Oracle Analytics Cloud (OAC) suite...without having to worry about client requirements or project deadlines!

As a company, Rittman Mead already has solid experience of OAC, but my personal exposure has been limited to presentations, product demonstrations, reading the various postings in the blog community and my existing experiences of Data Visualisation and BI cloud services (DVCS and BICS respectively). You’ll find Francesco’s post a good starting place if you need an overview of OAC and how it differs (or aligns) to Data Visualisation and BI Cloud Services.

So, having spent some time looking at the overall suite and, more importantly, trying to interpret what it could mean for organisations thinking about making a move to the cloud, here are my top three takeaways:

Clouds Come In Different Shapes and Flavours

Two of the main benefits that a move to the cloud offers are simplification in platform provisioning and an increase in flexibility, being able to ramp up or scale down resources at will. These both comes with a potential cost benefit, depending on your given scenario and requirement. The first step is understanding the different options in the OAC licensing and feature matrix.

First, we need to draw a distinction between Analytics Cloud and the Autonomous Analytics Cloud (interestingly, both options point to the same page on cloud.oracle.com, which makes things immediately confusing!). In a nutshell though, the distinction comes down to who takes responsibility for the service management: Autonomous Analytics Cloud is managed by Oracle, whilst Analytics Cloud is managed by yourself. It’s interesting to note that the Autonomous offering is marginally cheaper.

Next, Oracle have chosen to extend their BYOL (Bring Your Own License) option from their IaaS services to now incorporate PaaS services. This means that if you have existing licenses for the on-premise software, then you are able to take advantage of what appears to be a significantly discounted cost. Clearly, this is targeted to incentivise existing Oracle customers to make the leap into the Cloud, and should be considered against your ongoing annual support fees.

Since the start of the year, Analytics Cloud now comes in three different versions, with the Standard and Enterprise editions now being separated by the new Data Lake edition. The important things to note are that (possibly confusingly) Essbase is now incorporated into the Data Lake edition of the Autonomous Analytics Cloud and that for the full enterprise capability you have with OBIEE, you will need the Enterprise edition. Each version inherits the functionality of its preceding version: Enterprise edition gives you everything in the Data Lake edition; Data Lake edition incorporates everything in the Standard edition.


Finally, it’s worth noting that OAC aligns to the Universal Credit consumption model, whereby the cost is determined based on the size and shape of the cloud that you need. Services can be purchased as Pay as You Go or Monthly Flex options (with differential costing to match). The PAYG model is based on hourly consumption and is paid for in arrears, making it the obvious choice for short term prototyping or POC activities. Conversely, the Monthly Flex model is paid in advance and requires a minimum 12 month investment and therefore makes sense for full scale implementations. Then, the final piece of the jigsaw comes with the shape of the service you consume. This is measured in OCPU’s (Oracle Compute Units) and the larger your memory requirements, the more OCPU’s you consume.

Where You Put Your Data Will Always Matter

Moving your analytics platform into the cloud may make a lot of sense and could therefore be a relatively simple decision to make. However, the question of where your data resides is a more challenging subject, given the sensitivities and increasing legislative constraints that exist around where your data can or should be stored. The answer to that question will influence the performance and data latency you can expect from your analytics platform.

OAC is architected to be flexible when it comes to its data sources and consequently the options available for data access are pretty broad. At a high level, your choices are similar to those you would have when implementing on-premise, namely:

  • perform ELT processing to transform and move the data (into the cloud);
  • replicate data from source to target (in the cloud) or;
  • query data sources via direct access.

These are supplemented by a fourth option to use the inbuilt Data Connectors available in OAC to connect to cloud or on-premise databases, other proprietary platforms or any other source accessible via JDBC. This is probably a decent path for exploratory data usage within DV, but I’m not sure it would always make the best long term option.


Unsurprisingly, with the breadth of options comes a spectrum of tooling that can be used for shifting your data around and it is important to note that depending on your approach, additional cloud services may or may not be required.

For accessing data directly at its source, the preferred route seems to be to use RDC (Remote Data Connector), although it is worth noting that support is limited to Oracle (including OLAP), SQL Server, Teredata or DB2 databases. Also, be aware that RDC operates within WebLogic Server and so this will be needed within the on-premise network.

Data replication is typically achieved using Data Sync (the reincarnation of the DAC, which OBIA implementers will already be familiar with), although it is worth mentioning that there are other routes that could be taken, such as APEX or SQL Developer, depending on the data volumes and latency you have to play with.

Classic ELT processing can be achieved via Oracle Data Integrator (either the Cloud Service, a traditional on-premise implementation or a hybrid-model).

Ultimately, due care and attention needs to be taken when deciding on your data architecture as this will have a fundamental effect on the simplicity with which data can be accessed and interpreted, the query performance achieved and the data latency built into your analytics.

Data Flows Make For Modern Analytics Simplification

A while back, I wrote a post titled Enabling a Modern Analytics Platform in which I attempted to describe ways that Mode 1 (departmental) and Mode 2 (enterprise) analytics could be built out to support each other, as opposed to undermining one another. One of the key messages I made was the importance of having an effective mechanism for transitioning your Mode 1 outputs back into Mode 2 as seamlessly as possible. (The same is true in reverse for making enterprise data available as an Mode 1 input.)

One of the great things about OAC is how it serves to simplify this transition. Users are able to create analytic content based on data sourced from a broad range of locations: at the simplest level, Data Sets can be built from flat files or via one of the available Data Connectors to relational, NoSQL, proprietary database or Essbase sources. Moreover, enterprise curated metadata (via RPD lift-and-shift from an on-premise implementation) or analyst developed Subject Areas can be exposed. These sources can be ‘mashed’ together directly in a DV project or, for more complex or repeatable actions, Data Flows can be created to build Data Sets. Data Flows are pretty powerful, not only allowing users to join disparate data but also perform some useful data preparation activities, ranging from basic filtering, aggregation and data manipulation actions to more complex sentiment analysis, forecasting and even some machine learning modelling features. Importantly, Data Flows can be set to output their results to disk, either written to a Data Set or even to a database table and they can be scheduled for repetitive refresh.

For me, one of the most important things about the Data Flows feature is that it provides a clear and understandable interface which shows the sequencing of each of the data preparation stages, providing valuable information for any subsequent reverse engineering of the processing back into the enterprise data architecture.


In summary, there are plenty of exciting and innovative things happening with Oracle Analytics in the cloud and as time marches on, the case for moving to the cloud in one shape or form will probably get more and more compelling. However, beyond a strategic decision to ‘Go Cloud’, there are many options and complexities that need to be addressed in order to make a successful start to your journey - some technical, some procedural and some organisational. Whilst a level of planning and research will undoubtedly smooth the path, the great thing about the cloud services is that they are comparatively cheap and easy to initiate, so getting on and building a prototype is always going to be a good, exploratory starting point.

Categories: BI & Warehousing

Why DevOps Matters for Enterprise BI

Tue, 2018-06-12 09:44
Why DevOps Matters for Enterprise BI

Why are people frustrated with their existing enterprise BI tools such as OBIEE? My view is because it costs too much to produce relevant content. I think some of this is down to the tools themselves, and some of it is down to process.

Starting with the tools, they are not “bad” tools; the traditional licensing model can be expensive in today’s market, and traditional development methods are time-consuming and hence expensive. The vendor’s response is to move to the cloud and to highlight cost savings that can be made by having a managed platform. Oracle Analytics Cloud (OAC) is essentially OBIEE installed on Oracle’s servers in Oracle’s data centres with Oracle providing your system administration, coupled with the ability to flex your licensing on a monthly or annual basis.

Cloud does give organisations the potential for more agility. Provisioning servers can no longer hold up the start of a project, and if a system needs to increase capacity, then more CPUs or nodes can be added. This latter case is a bit murky due to the cost implications and the option to try and resolve performance issues through query efficiency on the database.

I don’t think this solves the problem. Tools that provide reports and dashboards are becoming more commoditised, up and coming vendors and platform providers are offering the service for a fraction of the cost of the traditional vendors. They may lack some of the enterprise features like open security models; however, these are an area that platform providers are continually improving. Over the last 10 years, Oracle's focus for OBIEE has been on more on integration than innovation. Oracle DV was a significant change; however, there is a danger that Oracle lost the first-mover advantage to tools such as Tableau and QlikView. Additionally, some critical features like lineage, software lifecycle development, versioning and process automation are not built in to OBIEE and worse still, the legacy design and architecture of the product often hinders these.

So this brings me back round to process. Defining “good” processes and having tools to support them is one of the best ways you can keep your BI tools relevant to the business by reducing the friction in generating content.

What is a “good” process? Put simply, a process that reduces the time between the identification of a business need and the realising it with zero impact on existing components of the system. Also, a “good” process should provide visibility of any design, development and testing, plus documentation of changes, typically including lineage in a modern BI system. Continuous integration is the Holy Grail.

This why DevOps matters. Using automated migration across environments, regression tests, automatically generated documentation in the form of lineage, native support for version control systems, supported merge processes and ideally a scripting interface or API to automate the generation of repetitive tasks such as changing the data type of a group of fields system-wide, can dramatically reduce the gap from idea to realisation.

So, I would recommend that when looking at your enterprise BI system, you not only consider the vendor, location and features but also focus on the potential for process optimisation and automation. Automation could be something that the vendor builds into the tool, or you may need to use accelerators or software provided by a third party. Over the next few weeks, we will be publishing some examples and case studies of how our BI and DI Developer Toolkits have helped clients and enabled them to automate some or all of the BI software development cycle, reducing the time to release new features and increasing the confidence and robustness of the system.

Categories: BI & Warehousing

Real-time Sailing Yacht Performance - stepping back a bit (Part 1.1)

Mon, 2018-06-11 07:20

Slight change to the planned article. At the end of my analysis in Part 1 I discovered I was missing a number of key messages. It turns out that not all the SeaTalk messages from the integrated instruments were being translated to an NMEA format and therefore not being sent wirelessly from the AIS hub. I didn't really want to introduce another source of data directly from the instruments as it would involve hard wiring the instruments to the laptop and then translating a different format of a message (SeaTalk). I decided to spend on some hardware (any excuse for new toys). I purchased a SeaTalk to NMEA converter from DigitalYachts (discounted at the London boat show I'm glad to say).

This article is about the installation of that hardware and the result (hence Part 1.1), not our usual type of blog. You never know it may be of interest to somebody out there and this is a real-life data issue! Don't worry it will be short and more of an insight into Yacht wiring than anything.

The next blog will be very much back on track. Looking at Kafka in the architecture.

The existing wiring

The following image shows the existing setup, what's behind the panels and how it links to the instrument architecture documented in Part 1. No laughing at the wiring spaghetti - I stripped out half a tonne of cable last year so this is an improvement. Most of the technology lives near the chart table and we have access to the navigation lights, cabin lighting, battery sensors and DSC VHF. The top left image also shows a spare GPS (Garmin) and far left an EPIRB.


I wanted to make sure I wasn't breaking anything by adding the new hardware so followed the approach we use as software engineers. Check before, during and after any changes enabling us to narrow down the point errors are introduced. To help with this I create a little bit of Python that reads the messages and lets me know the unique message types, the total number of messages and the number of messages in error.

import json
import sys

#DEF Function to test message
def is_message_valid (orig_line):

........ [Function same code described in Part 1]

#main body
f = open("/Development/step_1.log", "r")

valid_messages = 0
invalid_messages = 0
total_messages = 0
my_list = [""]
#process file main body
for line in f:

  orig_line = line

  if is_message_valid(orig_line):
    valid_messages = valid_messages + 1
    #look for wind message
    #print "message valid"

    if orig_line[0:1] == "$":
      if len(my_list) == 0:
        #print "ny list is empty"
        #print orig_line[0:5]

      #print orig_line[20:26]

    invalid_messages = invalid_messages + 1

  total_messages = total_messages + 1

new_list = list(set(my_list))

i = 0

while i < len(new_list):
    i += 1

#Hight tech report
print "Summary"
print "#######"
print "valid messages -> ", valid_messages
print "invalid messages -> ", invalid_messages
print "total mesages -> ", total_messages


For each of the steps, I used nc to write the output to a log file and then use the Python to analyse the log. I log about ten minutes of messages each step although I have to confess to shortening the last test as I was getting very cold.

nc -l 2000 > step_x.log

While spooling the message I artificially generate some speed data by spinning the wheel of the speedo. The image below shows the speed sensor and where it normally lives (far right image). The water comes in when you take out the sensor as it temporarily leaves a rather large hole in the bottom of the boat, don't be alarmed by the little puddle you can see.

Step 1;

I spool and analyse about ten minutes of data without making any changes to the existing setup.

The existing setup takes data directly from the back of a Raymarine instrument seen below and gets linked into the AIS hub.

$AITXT -> AIS (from AIS hub)

$GPRMC -> GPS (form AIS hub)

$IIDBT -> Depth sensor
$IIMTW -> Sea temperature sensor
$IIMWV -> Wind speed 

valid messages ->  2129
invalid messages ->  298
total mesages ->  2427
12% error

Step 2;

I disconnect the NMEA interface between the AIS hub and the integrated instruments. So in the diagram above I disconnect all four NMEA wires from the back of the instrument.

I observe the Navigation display of the integrated instruments no longer displays any GPS information (this is expected as the only GPS messages I have are coming from the AIS hub).


$AITXT -> AIS (from AIS hub)

$GPRMC -> GPS (form AIS hub)

No $II messages as expected 

valid messages ->  3639
invalid messages ->  232
total mesages ->  3871
6% error
Step 3;

I wire in the new hardware both NMEA in and out then directly into the course computer.


$AITXT -> AIS (from AIS hub)

$GPGBS -> GPS messages

$IIMTW -> Sea temperature sensor
$IIMWV -> Wind speed 
$IIVHW -> Heading & Speed
$IIRSA -> Rudder Angle
$IIHDG -> Heading
$IIVLW -> Distance travelled

valid messages ->  1661
invalid messages ->  121
total mesages ->  1782
6.7% error

I get all the messages I am after (for now) the hardware seems to be working.

Now to put all the panels back in place!

In the next article, I will get back to technology and the use of Kafka in the architecture.

Categories: BI & Warehousing

Rittman Mead at Kscope 2018

Thu, 2018-05-31 02:20
Rittman Mead at Kscope 2018

Kscope 2018 is just a week away! Magnificent location (Walt Disney World Swan and Dolphin Resort) for one of the best tech conferences of the year! The agenda is impressive (look here) spanning over ten different tracks from the traditional EPM, BI Analytics and Data Visualization, to the newly added Blockchain! Plenty of great content and networking opportunities!

I'll be representing Rittman Mead with two talks: one about Visualizing Streams (Wednesday at 10:15 Northern Hemisphere A2, Fifth Level) on how to build a modern analytical platform including Apache Kafka, Confluent's KSQL, Apache Drill and Oracle's Data Visualization (Cloud or Desktop).

Rittman Mead at Kscope 2018

During the second talk, titled DevOps and OBIEE:
Do it Before it's Too Late!
(Monday at 10:45 Northern Hemisphere A1, Fifth Level), I'll be sharing details, based on our experience, on how OBIEE can be fully included in a DevOps framework, what's the cost of "avoiding" DevOps and automation in general and how Rittman Mead's toolkits, partially described here, can be used to accelerate the adoption of DevOps practices in any situation.

Rittman Mead at Kscope 2018

If you’re at the event and you see me in sessions, around the conference or during my talks, I’d be pleased to speak with you about your projects and answer any questions you might have.

Categories: BI & Warehousing

Rittman Mead at OUG Norway 2018

Mon, 2018-03-05 04:45
Rittman Mead at OUG Norway 2018

This week I am very pleased to represent Rittman Mead by presenting at the Oracle User Group Norway Spring Seminar 2018 delivering two sessions about Oracle Analytics, Kafka, Apache Drill and Data Visualization both on-premises and cloud. The OUGN conference it's unique due to both the really high level of presentations (see related agenda) and the fascinating location being the Color Fantasy Cruiseferry going from Oslo to Kiev and back.

Rittman Mead at OUG Norway 2018

I'll be speaking on Friday 9th at 9:30AM in Auditorium 2 about Visualizing Streams on how the world of Business Analytics has changed in recent years and how to successfully build a Modern Analytical Platform including Apache Kafka, Confluent's recently announced KSQL and Oracle's Data Visualization.

Rittman Mead at OUG Norway 2018

On the same day at 5PM, always in Auditorium 2, I'll be delivering the session OBIEE: Going Down the Rabbit Hole: providing details, built on experience, on how diagnostic tools, non standard configuration and well defined processes can enhance, secure and accelerate any analytical project.

If you’re at the event and you see me in sessions, around the conference or during my talks, I’d be pleased to speak with you about your projects and answer any questions you might have.

Categories: BI & Warehousing

Spring into action with our new OBIEE 12c Systems Management & Security On Demand Training course

Mon, 2018-02-19 05:49

Rittman Mead are happy to release a new course to the On Demand Training platform.

The OBIEE 12c Systems Management & Security course is the essential learning tool for any developers or administrators who will be working on the maintenance & optimisation of their OBIEE platform.

Baseline Validation Tool

View lessons and live demos from our experts on the following subjects:

  • What's new in OBIEE 12c
  • Starting & Stopping Services
  • Managing Metadata
  • System Preferences
  • Troubleshooting Issues
  • Caching
  • Usage Tracking
  • Baseline Validation Tool
  • Direct Database Request
  • Write Back
  • LDAP Users & Groups
  • Application Roles
  • Permissions

Get hands on with the practical version of the course which comes with an OBIEE 12c training environment and 9 lab exercises.
System Preferences

Rittman Mead will also be releasing a theory version of the course. This will not include the lab exercises but gives each of the lessons and demos that you'd get as part of the practical course.

Course prices are as follows:

OBIEE 12c Systems Management & Security - PRACTICAL - $499

  • 30 days access to lessons & demos
  • 30 days access to OBIEE 12c training environment for lab exercises
  • 30 days access to Rittman Mead knowledge base for Q&A and lab support

OBIEE 12c Systems Management & Security - THEROY - $299

  • 30 days access to lessons & demos
  • 30 days access to Rittman Mead knowledge base for Q&A

To celebrate the changing of seasons we suggest you Spring into action with OBIEE 12c by receiving a 25% discount on both courses until 31st March 2018 using voucher code:


Access both courses and the rest of our catalog at learn.rittmanmead.com

Categories: BI & Warehousing

Confluent Partnership

Mon, 2018-02-12 09:14


Here at Rittman Mead, we are continually broadening the scope and expertise of our services to help our customers keep pace with today's ever-changing technology landscape. One significant change we have seen over the last few years is the increased adoption of data streaming. These solutions can help solve a variety of problems, from real-time data analytics to forming the entire backbone of an organisation's data architecture. We have worked with a number of different technologies that can enable this, however, we often see that Kafka ticks the most boxes.

This is reflected by some of the recent blog posts you will have seen like Tom Underhill hooking up his gaming console to Kafka and Paul Shilling’s piece on collating sailing data. Both these posts try and use day to day or real-world examples to demonstrate some of the concepts behind Kafka.

In conjunction with these, we have been involved in more serious proofs of concepts and project with clients involving Kafka, which no doubt we will write about in time. To help us further our knowledge and also immerse ourselves in the developer community we have decided to become Confluent partners. Confluent was founded by the people who initially developed Kafka at LinkedIn and provides a managed and supported version of Kafka through their platform.

We chose Confluent as we saw them as the driving force behind Kafka, plus the additions they are making to the platform such as the streaming API and KSQL are opening a lot of doors for how streamed data can be used.

We look forward to growing our knowledge and experience in this area and the possibilities that working with both Kafka and Confluent will bring us.

Categories: BI & Warehousing

Real-time Sailing Yacht Performance - Getting Started (Part 1)

Fri, 2018-01-19 03:54

In this series of articles, I intend to look at collecting and analysing our yacht’s data. I aim to show how a number of technologies can be used to achieve this and the thought processes around the build and exploration of the data. Ultimately, I want to improve our sailing performance with data, not a new concept for professional teams but well I have a limited amount of hardware and funds, unlike Oracle it seems, time for a bit of DIY!

In this article, I introduce some concepts and terms then I'll start cleaning and exploring the data.


I have owned a Sigma 400 sailing yacht for over twelve years and she is used primarily for cruising but every now and then we do a bit of offshore racing.

In the last few years we have moved from paper charts and a very much manual way of life to electronic charts and IOS apps for navigation.

In 2017 we started to use weather modelling software to predict the most optimal route of a passage taking wind, tide and estimated boat performance (polars) into consideration.

The predicted routes are driven in part by a boat's polars, the original "polars" are a set of theoretical calculations created by the boat’s designer indicating/defining what the boat should do at each wind speed and angle of sailing. Polars give us a plot of the boat's speed given a true wind speed and angle. This in turn informs us of the optimal speed the boat could achieve at any particular angle to wind and wind speed (not taking into consideration helming accuracy, sea state, condition of sails and sail trim - It may be possible for me to derive different polars for different weather conditions). Fundamentally, polars will also give us an indication of the most optimal angle to wind to get to our destination (velocity made good).

The polars we use at the moment are based on a similar boat to the Sigma 400 but are really a best guess. I want our polars to be more accurate. I would also like to start tracking the boats performance real-time and post passage for further analysis.

The purpose of this blog is to use our boats instrument data to create accurate polars for a number of conditions and get a better understanding of our boats performance at each point of sail. I would also see what can be achieved with the AIS data. I intend to use Python to check and decode the data. I will look at a number of tools to store, buffer, visualise and analyse the outputs.

So let’s look at the technology on-board.

Instrumentation Architecture

The instruments are by Raymarine. We have a wind vane, GPS, speed sensor, depth sounder and sea temperature gauge, electronic compass, gyroscope, and rudder angle reader. These are all fed into a central course computer. Some of the instrument displays share and enrich the data calculating such things as apparent wind angles as an example. All the data travels through a proprietary Raymarine messaging system called SeaTalk. To allow Raymarine instruments to interact with other instrumentation there is an NMEA-0183 port. NMEA-0183 is a well-known communication protocol and is fairly well documented so this is the data I need to extract from the system. I currently have an NMEA-0183 cable connecting the Raymarine instruments to an AIS transponder. The AIS transponder includes a Wireless router. The wireless router enables me to connect portable devices to the instrumentation.

The first task is to start looking at the data and get an understanding of what needs to be done before I can start analysing.

Analysing the data

There is a USB connection from the AIS hub however the instructions do warn that this should only be used during installation. I did spool some data from the USB port, it seemed to work OK. I could connect directly to the NMEA-0183 output however that would require me to do some wiring so will look at that if the reliability of the wireless causes issues. The second option was to use the wireless connection. I start by spooling the data to a log file using nc (nc is basically OSX's version of netcat, a TCP and UDP tool).

Spooling the data to a log file

nc  -p 1234 2000 > instrument.log

The spooled data gave me a clear indication that there would need to be some sanity checking of the data before it would be useful. The data is split into a number of different message types each consisting of a different structure. I will convert these messages into a JSON format so that the messages are more readable downstream. In the example below the timestamps displayed are attached using awk but my Python script will handle any enrichment as I build out.

The data is comma separated so this makes things easy and there a number of good websites that describe the contents of the messages. Looking at the data using a series of awk commands I clearly identify three main types of messages. GPS, AIS and Integrated instrument messages. Each message ends in a two-digit hex code this can be XOR'd to validate the message.

Looking at an example wind messages

We get two messages related to the wind true and apparent the data is the same because the boat was stationary.


These are Integrated Instrument Mast Wind Vain (IIMWV) * I have made an assumption about the meaning of M so if you are an expert in these messages feel free to correct me ;-)*

These messages break down to:

  1. $IIMWV II Talker, MWV Sentence
  2. 180.0 Wind Angle 0 - 359
  3. R Relative (T = True)
  4. 3.7 Wind Speed
  5. N Wind Speed Units Knots (N = KPH, M = MPH)
  6. A Status (A= Valid)
  7. *30 Checksums

And in English (ish)

180.0 Degrees Relative wind speed 1.9 Knots.

Example corrupted message


Looks like the message failed to get a new line. I notice a number of other types of incomplete or corrupted messages so checking them will be an essential part of the build.

Creating a message reader

I don't really want to sit on the boat building code. I need to be doing this while traveling and at home when I get time. So, spooling half an hour of data to a log file gets me started. I can use Python to read from the file and once up and running spool the log file to a local TCP/IP port and read using Python socket library.

Firstly, I read the log file and loop through the messages, each message I check to see if it's valid using the checksum, line length. I used this to log the number of messages in error etc. I have posted the test function, I'm sure there are better ways to write the code but it works.

#DEF Function to test message
 def is_message_valid (orig_line):

  #check if hash is valid
  #set variables
  x = 1
  check = 0
  received_checksum = 0
  line_length = len(orig_line.strip())

  while (x <= line_length):="" current_char="orig_line[x]" #checksum="" is="" always="" two="" chars="" after="" the="" *="" if="" "*":="" received_checksum="orig_line[x+1]" +="" orig_line[x+2]="" #check="" where="" we="" are="" there="" more="" to="" decode="" then="" #have="" take="" into="" account="" new="" line="" line_length=""> (x+3):
        check = 0

      #no need to continue to the end of the 
      #line either error or have checksum

    check = check^ord(current_char)
    x = x + 1; 

  if format(check,"2X") == received_checksum:
    #substring the new line for printing
    #print "Processed nmea line >> " + orig_line[:-1] + " Valid message" 
    _Valid = 1
    #substring the new line for printing
    _Valid = 0

  return _Valid

Now for the translation of messages. There are a number of example Python packages in GitHub that translate NMEA messages but I am only currently interested in specific messages, I also want to build appropriate JSON so feel I am better writing this from scratch. Python has JSON libraries so fairly straight forward once the message is defined. I start by looking at the wind and depth messages. I'm not currently seeing any speed messages hopefully because the boat wasn't moving.

def convert_iimwv_json (orig_line):
 #iimwv wind instrumentation

 column_list = orig_line.split(",")

 #star separates the checksum from status
 status_check_sum = column_list[5].split("*")
 checksum_value = status_check_sum[1]

 json_str = 
 {'message_type' : column_list[0], 
 'wind_angle' : column_list[1], 
 'relative' : column_list[2], 
 'wind_speed' : column_list[3], 
 'wind_speed_units' : column_list[4], 
 'status' : status_check_sum[0], 
 'checksum' : checksum_value[0:2]}

 json_dmp = json.dumps(json_str)
 json_obj = json.loads(json_dmp)

 return json_str

I now have a way of checking, reading and converting the message to JSON from a log file. Switching from reading a file to to using the Python socket library I can read the stream directly from a TCP/IP port. Using nc it's possible to simulate the message being sent from the instruments by piping the log file to a port.

Opening port 1234 and listening for terminal input

nc -l 1234

Having spoken to some experts from Digital Yachts it maybe that the missing messages are because Raymarine SeakTalk is not transmitting an NMEA message for speed and a number of other readings. The way I have wired up the NMEA inputs and outputs to the AIS hub may also be causing the doubling up of messages and apparent corruptions. I need more kit! A bi-direction SeaTalk to NMEA converter.

In the next article, I discuss the use of Kafka in the architecture. I want to buffer all my incoming raw messages. If I store all the incoming I can build out the analytics over time i.e as I decode each message type. I will also set about creating a near real time dashboard to display the incoming metrics. The use of Kafka will give me scalability in the model. I'm particularly thinking of Round the Island Race 1,800 boats a good number of these will be transmitting AIS data.

Categories: BI & Warehousing

Rittman Mead at UKOUG 2017

Mon, 2017-12-04 02:58

For those of you attending the UKOUG this year, we are giving three presentations on OBIEE and Data Visualisation.

Francesco Tisiot has two on Monday:

  • 14.25 // Enabling Self-Service Analytics With Analytic Views & Data Visualization From Cloud to Desktop - Hall 7a
  • 17:55 // OBIEE: Going Down the Rabbit Hole - Hall 7a

Federico Venturin is giving his culinary advice on Wednesday:

  • 11:25 // Visualising Data Like a Top Chef - Hall 6a

And Mike Vickers is diving into BI Publisher, also on Wednesday

  • 15:15 // BI Publisher: Teaching Old Dogs Some New Tricks - Hall 6a

In addition, Sam Jeremiah and I are also around, so if anyone wants to catch up, grab us for a coffee or a beer.

Categories: BI & Warehousing

Taking KSQL for a Spin Using Real-time Device Data

Tue, 2017-11-07 06:41
Taking KSQL for a Spin Using Real-time Device Data Taking KSQL for a Spin Using Real-time Device Data

Evaluating KSQL has been high on my to-do list ever since it was released back in August. I wanted to experiment with it using an interesting, high velocity, real-time data stream that would allow me to analyse events at the millisecond level, rather than seconds or minutes. Finding such a data source, that is free of charge and not the de facto twitter stream, is tricky. So, after some pondering, I decided that I'd use my Thrustmaster T300RS Steering Wheel/Pedal Set gaming device as a data source,

Taking KSQL for a Spin Using Real-time Device Data

The idea being that the data would be fed into Kafka, processed in real-time using KSQL and visualised in Grafana.

This is the end to end pipeline that I created...

Taking KSQL for a Spin Using Real-time Device Data

...and this is the resulting real-time dashboard running alongside a driving game and a log of the messages being sent by the device.

This article will explain how the above real-time dashboard was built using only KSQL...and a custom Kafka producer.

I'd like to point out, that although the device I'm using for testing is unconventional, when considered in the wider context of IoT's, autonomous driving, smart automotives or any device for that matter, it will be clear to see that the low latency, high throughput of Apache Kafka, coupled with Confluent's KSQL, can be a powerful combination.

I'd also like to point out, that this article is not about driving techniques, driving games or telemetry analysis. However, seeing as the data source I'm using is intrinsically tied to those subjects, the concepts will be discussed to add context. I hope you like motorsports!

Writing a Kafka Producer for a T300RS

The T300RS is attached to my Windows PC via a USB cable, so the first challenge was to try and figure out how I could get steering, braking and accelerator inputs pushed to Kafka. Unsurprisingly, a source connector for a "T300RS Steering Wheel and Pedal Set" was not listed on the Kafka Connect web page - a custom producer was the only option.

To access the data being generated by the T300RS, I had 2 options, I could either use an existing Telemetry API from one of my racing games, or I could access it directly using the Windows DirectX API. I didn't want to have to have a game running in the background in order to generate data, so I decided to go down the DirectX route. This way, the data is raw and available, with or without an actual game engine running.

The producer was written using the SharpDX .NET wrapper and Confluent's .NET Kafka Client. The SharpDX directinput API allows you to poll an attached input device (mouse, keyboard, game controllers etc.) and read its buffered data. The buffered data returned within each polling loop is serialized into JSON and sent to Kafka using the .NET Kafka Client library.

A single message is sent to a topic in Kafka called raw_axis_inputs every time the state of one the device's axes changes. The device has several axes, in this article I am only interested in the Wheel, Accelerator, Brake and the X button.

    "event_id":4300415,         // Event ID unique over all axis state changes
    "timestamp":1508607521324,  // The time of the event
    "axis":"Y",                 // The axis this event belongs to
    "value":32873.0             // the current value of the axis

This is what a single message looks like. In the above message the Brake axis state was changed, i.e. it moved to a new position with value 32873.

You can see below which inputs map to the each reported axis from the device.

Taking KSQL for a Spin Using Real-time Device Data

Here is a sample from the producer's log file.


You can tell by looking at the timestamps, it's possible to have multiple events generated within the same millisecond, I was unable to get microsecond precision from the device unfortunately. When axes, "X", "Y" and "RotationZ" are being moved quickly at the same time (a bit like a child driving one of those coin operated car rides you find at the seaside) the device generates approximately 500 events per second.

Creating a Source Stream

Now that we have data streaming to Kafka from the device, it's time to fire up KSQL and start analysing it. The first thing we need to do is create a source stream. The saying "Every River Starts with a Single Drop" is quite fitting here, especially in the context of stream processing. The raw_axis_inputs topic is our "Single Drop" and we need to create a KSQL stream based on top of it.

CREATE STREAM raw_axis_inputs ( \  
     event_id BIGINT, \
     timestamp BIGINT, \
     axis VARCHAR, \
     value DOUBLE ) \
 WITH (kafka_topic = 'raw_axis_inputs', value_format = 'JSON');

With the stream created we can we can now query it. I'm using the default auto.offset.reset = latest as I have the luxury of being able to blip the accelerator whenever I want to generate new data, a satisfying feeling indeed.

ksql> SELECT * FROM raw_axis_inputs;  
1508693510267 | null | 4480290 | 1508693510263 | RotationZ | 65278.0  
1508693510269 | null | 4480291 | 1508693510263 | RotationZ | 64893.0  
1508693510271 | null | 4480292 | 1508693510263 | RotationZ | 63993.0  
1508693510273 | null | 4480293 | 1508693510263 | RotationZ | 63094.0  
1508693510275 | null | 4480294 | 1508693510279 | RotationZ | 61873.0  
1508693510277 | null | 4480295 | 1508693510279 | RotationZ | 60716.0  
1508693510279 | null | 4480296 | 1508693510279 | RotationZ | 60267.0  
Derived Streams

We now have our source stream created and can start creating some derived streams from it. The first derived stream we are going to create filters out 1 event. When the X button is pressed it emits a value of 128, when it's released it emits a value of 0.

Taking KSQL for a Spin Using Real-time Device Data

To simplify this input, I'm filtering out the release event. We'll see what the X button is used for later in the article.

CREATE STREAM axis_inputs WITH (kafka_topic = 'axis_inputs') AS \  
SELECT  event_id, \  
        timestamp, \
        axis, \
        value \
FROM    raw_axis_inputs \  
WHERE   axis != 'Buttons5' OR value != 0.0;  

From this stream we are going to create 3 further streams, one for the brake, one the accelerator and one for the wheel.

All 3 axes emit values in the range of 0-65535 across their full range. The wheel emits a value of 0 when rotated fully left, a value of 65535 when rotated fully right and 32767 when dead centre. The wheel itself is configured to rotate 900 degrees lock-to-lock, so it would be nice to report its last state change in degrees, rather than from a predetermined integer range. For this we can create a new stream, that includes only messages where the axis = 'X', and the axis values are translated into the range of -450 degrees to 450 degrees. With this new value translation, maximum rotation left now equates to 450 degrees and maximum rotation right equates -450 degrees, 0 is now dead centre.

CREATE STREAM steering_inputs WITH (kafka_topic = 'steering_inputs') AS \  
  SELECT  axis, \
          event_id, \
          timestamp, \
          (value / (65535.0 / 900.0) - 900 / 2) * -1 as value \
  FROM    axis_inputs \
  WHERE   axis = 'X';

If we now query our new stream and move the wheel slowly around dead centre, we get the following results

ksql> select timestamp, value from steering_inputs;

1508711287451 | 0.6388888888889142  
1508711287451 | 0.4305555555555429  
1508711287451 | 0.36111111111108585  
1508711287451 | 0.13888888888891415  
1508711287451 | -0.0  
1508711287467 | -0.041666666666685614  
1508711287467 | -0.26388888888891415  
1508711287467 | -0.3333333333333144  
1508711287467 | -0.5277777777777715  
1508711287467 | -0.5972222222222285  

The same query while the wheel is rotated fully left

1508748345943 | 449.17601281757845  
1508748345943 | 449.3270771343557  
1508748345943 | 449.5330739299611  
1508748345943 | 449.67040512703136  
1508748345959 | 449.8214694438087  
1508748345959 | 449.95880064087896  
1508748345959 | 450.0  

And finally, rotated fully right.

1508748312803 | -449.3408102540627  
1508748312803 | -449.4369420920119  
1508748312818 | -449.67040512703136  
1508748312818 | -449.7390707255665  
1508748312818 | -449.9725337605859  
1508748312818 | -450.0  

Here's the data plotted in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

We now need to create 2 more derived streams to handle the accelerator and the brake pedals. This time, we want to translate the values to the range 0-100. When a pedal is fully depressed it should report a value of 100 and when fully released, a value of 0.

CREATE STREAM accelerator_inputs WITH (kafka_topic = 'accelerator_inputs') AS \  
SELECT  axis, \  
        event_id, \
        timestamp, \
        100 - (value / (65535.0 / 100.0)) as value \
FROM    axis_inputs \  
WHERE   axis = 'RotationZ';  

Querying the accelerator_inputs stream while fully depressing the accelerator pedal displays the following. (I've omitted many records in the middle to keep it short)

ksql> SELECT timestamp, value FROM accelerator_inputs;  
1508749747115 | 0.0  
1508749747162 | 0.14198473282442592  
1508749747193 | 0.24122137404580712  
1508749747209 | 0.43664122137404604  
1508749747225 | 0.5343511450381726  
1508749747287 | 0.6335877862595396  
1508749747318 | 0.7312977099236662  
1508749747318 | 0.8290076335877927  
1508749747334 | 0.9267175572519051  
1508749747381 | 1.0259541984732863  
1508749753943 | 98.92519083969465  
1508749753959 | 99.02290076335878  
1508749753959 | 99.1206106870229  
1508749753959 | 99.21832061068702  
1508749753975 | 99.31603053435114  
1508749753975 | 99.41374045801527  
1508749753975 | 99.5114503816794  
1508749753990 | 99.60916030534351  
1508749753990 | 99.70687022900763  
1508749753990 | 99.80458015267176  
1508749754006 | 100.0

...and displayed in Grafana

Taking KSQL for a Spin Using Real-time Device Data

Finally, we create the brake stream, which has the same value translation as the accelerator stream, so I won't show the query results this time around.

CREATE STREAM brake_inputs WITH (kafka_topic = 'brake_inputs') AS \  
SELECT  axis, \  
        event_id, \
        timestamp, \
        100 - (value / (65535 / 100)) as value \
FROM    axis_inputs \  
WHERE   axis = 'Y';  

Braking inputs in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

Smooth is Fast

It is a general rule of thumb in motorsports that "Smooth is Fast", the theory being that the less steering, accelerator and braking inputs you can make while still keeping the car on the desired racing line, results in a faster lap time. We can use KSQL to count the number of inputs for each axis over a Hopping Window to try and capture overall smoothness. To do this, we create our first KSQL table.

CREATE TABLE axis_events_hopping_5s_1s \  
WITH (kafka_topic = 'axis_events_hopping_5s_1s') AS \  
SELECT  axis, \  
        COUNT(*) AS event_count \
FROM    axis_inputs \  
GROUP BY axis;  

A KSQL table is basically a view over an existing stream or another table. When a table is created from a stream, it needs to contain an aggregate function and group by clause. It's these aggregates that make a table stateful, with the underpinning stream updating the table's current view in the background. If you create a table based on another table you do not need to specify an aggregate function or group by clause.

The table we created above specifies that data is aggregated over a Hopping Window. The size of the window is 5 seconds and it will advance or hop every 1 second. This means that at any one time, there will be 5 open windows, with new data being directed to each window based on the key and the record's timestamp.

You can see below when we query the stream, that we have 5 open windows per axis, with each window 1 second apart.

ksql> SELECT * FROM axis_events_hopping_5s_1s;  
1508758267000 | X : Window{start=1508758267000 end=-} | X | 56  
1508758268000 | X : Window{start=1508758268000 end=-} | X | 56  
1508758269000 | X : Window{start=1508758269000 end=-} | X | 56  
1508758270000 | X : Window{start=1508758270000 end=-} | X | 56  
1508758271000 | X : Window{start=1508758271000 end=-} | X | 43  
1508758267000 | Y : Window{start=1508758267000 end=-} | Y | 25  
1508758268000 | Y : Window{start=1508758268000 end=-} | Y | 25  
1508758269000 | Y : Window{start=1508758269000 end=-} | Y | 25  
1508758270000 | Y : Window{start=1508758270000 end=-} | Y | 32  
1508758271000 | Y : Window{start=1508758271000 end=-} | Y | 32  
1508758267000 | RotationZ : Window{start=1508758267000 end=-} | RotationZ | 67  
1508758268000 | RotationZ : Window{start=1508758268000 end=-} | RotationZ | 67  
1508758269000 | RotationZ : Window{start=1508758269000 end=-} | RotationZ | 67  
1508758270000 | RotationZ : Window{start=1508758270000 end=-} | RotationZ | 67  
1508758271000 | RotationZ : Window{start=1508758271000 end=-} | RotationZ | 39  

This data is going to be pushed into InfluxDB and therefore needs a timestamp column. We can create a new table for this, that includes all columns from our current table, plus the rowtime.

CREATE TABLE axis_events_hopping_5s_1s_ts \  
WITH (kafka_topic = 'axis_events_hopping_5s_1s_ts') AS \  
SELECT  rowtime AS timestamp, * \  
FROM    axis_events_hopping_5s_1s;  

And now, when we query this table we can see we have all the columns we need.

ksql> select timestamp, axis, event_count from axis_events_hopping_5s_1s_ts;  
1508761027000 | RotationZ | 61  
1508761028000 | RotationZ | 61  
1508761029000 | RotationZ | 61  
1508761030000 | RotationZ | 61  
1508761031000 | RotationZ | 61  
1508761028000 | Y | 47  
1508761029000 | Y | 47  
1508761030000 | Y | 47  
1508761031000 | Y | 47  
1508761032000 | Y | 47  
1508761029000 | X | 106  
1508761030000 | X | 106  
1508761031000 | X | 106  
1508761032000 | X | 106  
1508761033000 | X | 106  

This is the resulting graph in Grafana with each axis stacked on top of each other giving a visual representation of the total number of events overall and total per axis. The idea here being that if you can drive a lap with less overall inputs or events then the lap time should be faster.

Taking KSQL for a Spin Using Real-time Device Data

Calculating Lap Times

To calculate lap times, I needed a way of capturing the time difference between 2 separate events in a stream. Remember that the raw data is coming directly from the device and has no concept of lap, lap data is handled by a game engine.
I needed a way to inject an event into the stream when I crossed the start/finish line of any given race track. To achieve this, I modified the custom producer to increment a counter every time the X button was pressed and added a new field to the JSON message called lap_number.

Taking KSQL for a Spin Using Real-time Device Data

I then needed to recreate my source stream and my initial derived stream to include this new field

New source stream

CREATE STREAM raw_axis_inputs ( \  
     event_id BIGINT, \
     timestamp BIGINT, \
     lap_number BIGINT, \
     axis VARCHAR, \
     value DOUBLE ) \
 WITH (kafka_topic = 'raw_axis_inputs', value_format = 'JSON');

New derived stream.

CREATE STREAM axis_inputs WITH (kafka_topic = 'axis_inputs') AS \  
SELECT  event_id, \  
        timestamp, \
        lap_number, \
        axis, \
        value \
FROM    raw_axis_inputs \  
WHERE   axis != 'Buttons5' OR value != 0.0;  

Now when I query the axis_inputs stream and press the X button a few times we can see an incrementing lap number.

ksql> SELECT timestamp, lap_number, axis, value FROM axis_inputs;  
1508762511506 | 6 | X | 32906.0  
1508762511553 | 6 | X | 32907.0  
1508762511803 | 6 | X | 32909.0  
1508762512662 | 7 | Buttons5 | 128.0  
1508762513178 | 7 | X | 32911.0  
1508762513256 | 7 | X | 32913.0  
1508762513318 | 7 | X | 32914.0  
1508762513381 | 7 | X | 32916.0  
1508762513459 | 7 | X | 32918.0  
1508762513693 | 7 | X | 32919.0  
1508762514584 | 8 | Buttons5 | 128.0  
1508762515021 | 8 | X | 32921.0  
1508762515100 | 8 | X | 32923.0  
1508762515209 | 8 | X | 32925.0  
1508762515318 | 8 | X | 32926.0  
1508762515678 | 8 | X | 32928.0  
1508762516756 | 8 | X | 32926.0  
1508762517709 | 9 | Buttons5 | 128.0  
1508762517756 | 9 | X | 32925.0  
1508762520381 | 9 | X | 32923.0  
1508762520709 | 9 | X | 32921.0  
1508762520881 | 10 | Buttons5 | 128.0  
1508762521396 | 10 | X | 32919.0  
1508762521568 | 10 | X | 32918.0  
1508762521693 | 10 | X | 32916.0  
1508762521803 | 10 | X | 32914.0  

The next step is to calculate the time difference between each "Buttons5" event (the X button). This required 2 new tables. The first table below captures the latest values using the MAX() function from the axis_inputs stream where the axis = 'Buttons5'

CREATE TABLE lap_marker_data WITH (kafka_topic = 'lap_marker_data') AS \  
SELECT  axis, \  
        MAX(event_id) AS lap_start_event_id, \
        MAX(timestamp) AS lap_start_timestamp, \ 
        MAX(lap_number) AS lap_number \
FROM    axis_inputs \  
WHERE   axis = 'Buttons5' \  
GROUP BY axis;  

When we query this table, a new row is displayed every time the X button is pressed, reflecting the latest values from the stream.

ksql> SELECT axis, lap_start_event_id, lap_start_timestamp, lap_number FROM lap_marker_data;  
Buttons5 | 4692691 | 1508763302396 | 15  
Buttons5 | 4693352 | 1508763306271 | 16  
Buttons5 | 4693819 | 1508763310037 | 17  
Buttons5 | 4693825 | 1508763313865 | 18  
Buttons5 | 4694397 | 1508763317209 | 19  

What we can now do is join this table to a new stream.

CREATE STREAM lap_stats WITH (kafka_topic = 'lap_stats') AS \  
SELECT  l.lap_number as lap_number, \  
        l.lap_start_event_id, \
        l.lap_start_timestamp, \
        a.timestamp AS lap_end_timestamp, \
        (a.event_id - l.lap_start_event_id) AS lap_events, \
        (a.timestamp - l.lap_start_timestamp) AS laptime_ms \
FROM       axis_inputs a LEFT JOIN lap_marker_data l ON a.axis = l.axis \  
WHERE   a.axis = 'Buttons5';    

Stream created

ksql> describe lap_stats;

 Field               | Type
 ROWTIME             | BIGINT
 ROWKEY              | VARCHAR (STRING)

This new stream is again based on the axis_inputs stream where the axis = 'Buttons5'. We are joining it to our lap_marker_data table which results in a stream where every row includes the current and previous values at the point in time when the X button was pressed.

A quick query should illustrate this (I've manually added column heading to make it easier to read)

ksql> SELECT lap_number, lap_start_event_id, lap_start_timestamp, lap_end_timestamp, lap_events, laptime_ms FROM lap_stats;

36 | 4708512 | 1508764549240 | 1508764553912 | 340   | 4672  
37 | 4708852 | 1508764553912 | 1508764567521 | 1262  | 13609  
38 | 4710114 | 1508764567521 | 1508764572162 | 1174  | 4641  
39 | 4711288 | 1508764572162 | 1508764577865 | 1459  | 5703  
40 | 4712747 | 1508764577865 | 1508764583725 | 939   | 5860  
41 | 4713686 | 1508764583725 | 1508764593475 | 2192  | 9750  
42 | 4715878 | 1508764593475 | 1508764602318 | 1928  | 8843

We can now see the time difference, in milliseconds ( LAP_TIME_MS ), between each press of the X button. This data can now be displayed in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

The data is also being displayed along the top of the dashboard, aligned above the other graphs, as a ticker to help visualize lap boundaries across all axes.

Taking KSQL for a Spin Using Real-time Device Data

Anomaly Detection

A common use case when performing real-time stream analytics is Anomaly Detection, the act of detecting unexpected events, or outliers, in a stream of incoming data. Let's see what we can do with KSQL in this regard.

Driving Like a Lunatic?

As mentioned previously, Smooth is Fast, so it would be nice to be able to detect some form of erratic driving. When a car oversteers, the rear end of the car starts to rotate around a corner faster than you'd like, to counteract this motion, quick steering inputs are required to correct it. On a smooth lap you will only need a small part of the total range of the steering wheel to safely navigate all corners, when you start oversteering you will need make quick, but wider use of the total range of the wheel to keep the car on the track and prevent crashing.

To try and detect oversteer we need to create another KSQL table, this time based on the steering_inputs stream. This table counts steering events across a very short hopping window. Events are counted only if the rotation exceeds 180 degrees (sharp left rotation) or is less than -180 degrees (sharp right rotation)

CREATE TABLE oversteer WITH (kafka_topic = 'oversteer') AS \  
SELECT  axis, \  
        COUNT(*) \
FROM    steering_inputs \  
WHERE   value > 180 or value < -180 \  
GROUP by axis;  

We now create another table that includes the timestamp for InfluxDB.

CREATE TABLE oversteer_ts WITH (kafka_topic = 'oversteer_ts') AS \  
SELECT rowtime AS timestamp, * \  
FROM oversteer;  

If we query this table, while quickly rotating the wheel in the range value > 180 or value < -180, we can see multiple windows, 10ms apart, with a corresponding count of events.

ksql> SELECT * FROM oversteer_ts;  
1508767479920 | X : Window{start=1508767479920 end=-} | 1508767479920 | X | 5  
1508767479930 | X : Window{start=1508767479930 end=-} | 1508767479930 | X | 10  
1508767479940 | X : Window{start=1508767479940 end=-} | 1508767479940 | X | 15  
1508767479950 | X : Window{start=1508767479950 end=-} | 1508767479950 | X | 20  
1508767479960 | X : Window{start=1508767479960 end=-} | 1508767479960 | X | 25  
1508767479970 | X : Window{start=1508767479970 end=-} | 1508767479970 | X | 30  
1508767479980 | X : Window{start=1508767479980 end=-} | 1508767479980 | X | 35  
1508767479990 | X : Window{start=1508767479990 end=-} | 1508767479990 | X | 40  
1508767480000 | X : Window{start=1508767480000 end=-} | 1508767480000 | X | 45  
1508767480010 | X : Window{start=1508767480010 end=-} | 1508767480010 | X | 50  
1508767480020 | X : Window{start=1508767480020 end=-} | 1508767480020 | X | 50  
1508767480030 | X : Window{start=1508767480030 end=-} | 1508767480030 | X | 50  
1508767480040 | X : Window{start=1508767480040 end=-} | 1508767480040 | X | 50  
1508767480050 | X : Window{start=1508767480050 end=-} | 1508767480050 | X | 50  
1508767480060 | X : Window{start=1508767480060 end=-} | 1508767480060 | X | 47  
1508767480070 | X : Window{start=1508767480070 end=-} | 1508767480070 | X | 47  
1508767480080 | X : Window{start=1508767480080 end=-} | 1508767480080 | X | 47  
1508767480090 | X : Window{start=1508767480090 end=-} | 1508767480090 | X | 47  
1508767480100 | X : Window{start=1508767480100 end=-} | 1508767480100 | X | 47  

This data is plotted on the Y axis (we're talking graphs now) on the "Steering inputs" panel in Grafana. The oversteer metric can be seen in red and will spike when steering input exceeds 180 degrees in either direction.

Taking KSQL for a Spin Using Real-time Device Data

Braking too Hard?

Another anomaly I'd like to detect is when maximum brake pressure is applied for too long. Much like the brake pedal in a real car, the brake pedal I'm using has a very progressive feel, a fair amount of force from your foot is required to hit maximum pressure. If you do hit maximum pressure, it shouldn't be for long as you will most likely lock the wheels and skid off the race track, very embarrassing indeed.

The first thing to do is to create a table that will store the last time maximum brake pressure was applied. This table is based on the brake_inputs stream and filters where the value = 100

CREATE TABLE max_brake_power_time \  
WITH (kafka_topic = 'max_brake_power_time') AS \  
SELECT  axis, \  
        MAX(timestamp) as last_max_brake_ts \
FROM    brake_inputs \  
WHERE     value = 100 \  
GROUP by axis;  

A query of this table displays a new row each time maximum brake pressure is hit.

ksql> SELECT axis, last_max_brake_ts FROM max_brake_power_time;  
 Y | 1508769263100
 Y | 1508769267881
 Y | 1508769271568

Something worth mentioning is that if I hold my foot on the brake pedal at the maximum pressure for any period of time, only one event is found in the stream. This is because the device only streams data when the state of an axis changes. If I keep my foot still, no new events will appear in the stream. I'll deal with this in a minute.

Next we'll create a new stream based on the brake_inputs stream and join it to our max_brake_power_time table.

CREATE STREAM brake_inputs_with_max_brake_power_time \  
WITH ( kafka_topic = 'brake_inputs_with_max_brake_power_time') AS \  
SELECT  bi.value, \  
        bi.timestamp, \
        mb.last_max_brake_ts, \
        bi.timestamp - mb.last_max_brake_ts AS time_since_max_brake_released \
FROM    brake_inputs bi LEFT JOIN max_brake_power_time mb ON bi.axis = mb.axis;  

For each row in this stream we now have access to all columns in the brake_inputs stream plus a timestamp telling us when max brake power was last reached. With this data we create a new derived column bi.timestamp - mb.last_max_brake_ts AS time_since_max_brake_released which gives a running calculation of the difference between the current record timestamp and the last time maximum brake pressure was applied

For example, when we query the stream we can see that maximum pressure was applied at timestamp 1508772739115 with a value of 100.0. It's the row immediately after this row that we're are interested in 99.90234225 | 1508772740803 | 1508772739115 | 1688.

Again, I've manually added column headings to make it easier to read.

ksql> SELECT value, timestamp, last_max_brake_ts, time_since_max_brake_released FROM brake_inputs_with_max_brake_power_time;

98.53513389 | 1508772739100 | 1508772733146       | 5954  
98.82810711 | 1508772739100 | 1508772733146       | 5954  
99.02342259 | 1508772739115 | 1508772733146       | 5969  
99.51171129 | 1508772739115 | 1508772733146       | 5969  
99.70702677 | 1508772739115 | 1508772733146       | 5969  
100.0       | 1508772739115 | 1508772733146       | 5969  
99.90234225 | 1508772740803 | 1508772739115       | 1688  
99.51171129 | 1508772740818 | 1508772739115       | 1703  
99.12108033 | 1508772740818 | 1508772739115       | 1703  
97.65621423 | 1508772740818 | 1508772739115       | 1703  
96.58197909 | 1508772740818 | 1508772739115       | 1703  
95.41008621 | 1508772740818 | 1508772739115       | 1703  
94.43350881 | 1508772740818 | 1508772739115       | 1703  
93.65224689 | 1508772740818 | 1508772739115       | 1703  
93.35927367 | 1508772740818 | 1508772739115       | 1703  
92.87098496 | 1508772740834 | 1508772739115       | 1719  
92.38269626 | 1508772740834 | 1508772739115       | 1719  
91.11314564 | 1508772740834 | 1508772739115       | 1719  
90.62485694 | 1508772740834 | 1508772739115       | 1719  
90.42954146 | 1508772740834 | 1508772739115       | 1719  
89.35530632 | 1508772740834 | 1508772739115       | 1719  
87.89044022 | 1508772740834 | 1508772739115       | 1719  
87.40215152 | 1508772740850 | 1508772739115       | 1735  
86.52323186 | 1508772740850 | 1508772739115       | 1735  

Remember, that while an axis is held at the same value, 100.0 in this case, no more events will appear in the stream until the value changes again. This is why we are interested in the row preceding the maximum value, this row is telling us how long the value of 100.0 was applied for. In this case the time it was held for was 1688 milliseconds. Notice that on subsequent rows the value increases, but we are not interested in those rows. In order to isolate what we want, we need another table. This new table takes our previously created stream, brake_inputs_with_max_brake_power_time and groups it by the last_max_brake_ts column. For each grouping we then get the MIN(time_since_max_brake_released).

CREATE TABLE hard_braking WITH ( kafka_topic = 'hard_braking') AS \  
SELECT  last_max_brake_ts, \  
        MIN(time_since_max_brake_released) AS time_spent_at_max_brake_ms \
FROM    brake_inputs_with_max_brake_power_time \  
GROUP BY last_max_brake_ts;  

When we query this table, while stepping hard on the brake pedal for a few seconds at a time, we get the information we want. We can see the timestamp for when maximum brake pressure reached and for how long it was sustained.

ksql> SELECT last_max_brake_ts, time_spent_at_max_brake_ms FROM hard_braking;  
1508775178693 | 1360  
1508775178693 | 1360  
1508775183334 | 1000  
1508775183334 | 1000  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775191256 | 1344  
1508775191256 | 1344  
1508775191256 | 1344  
1508775195850 | 1687  
1508775195850 | 1687  
1508775195850 | 1687  
1508775200662 | 1922  
1508775200662 | 1922  
1508775200662 | 1922  
1508775200662 | 1922  

Here's what the above data looks like when visualised in Grafana. The bottom graph is showing when maximum brake pressure was hit and on for how long it was sustained. I've set a threshold against the graph of 1 second so any extreme braking is clearly identifiable - if you're that hard on the brakes for that long, you're probably going to end up in the scenery.

Taking KSQL for a Spin Using Real-time Device Data

The Tale of 2 Laps

After putting it all together, it's time to take to the track and see how it looks. This video shows 2 complete laps onboard with the Caterham Seven 620R around Brands Hatch in the UK. The first lap is a relatively smooth one and the second is quite ragged. Notice that the first lap ( lap 68 ) is quicker overall than the second ( lap 69 ). On lap 69, I start to drive more aggressively and oversteer spikes start to appear in the steering input graph. Lap 69 also has significantly more events overall than lap 68 as a result my more exuberant ( slower ) driving style. You'll also notice that maximum brake pressure is reached a couple of times on each lap, but for no longer than the threshold of 1 second on each occurrence.


KSQL is awesome! Although it's only a developer preview at this point, it's impressive what you can get done with it. As it evolves over time and mirrors more of the functionality of the underlying Streams API it will become even more powerful, lowering the barrier to entry for real-time stream processing further and further. Take a look at the road map to see what may be coming next.

Oh, and I recently discovered on the #KSQL community Slack group, that you can execute KSQL in Embedded Mode right inside your Java code, allowing you to mix the native Streams API with KSQL - very nice indeed !

Categories: BI & Warehousing

KSQL: Streaming SQL for Apache Kafka

Wed, 2017-10-18 10:18
 Streaming SQL for Apache Kafka

Few weeks back, while I was enjoying my holidays in the south of Italy, I started receiving notifications about an imminent announcement by Confluent. Reading the highlights almost (...I said almost) made me willing to go immediately back to work and check all the details about it.
The announcement regarded KSQL: a streaming SQL engine for Apache Kafka!

My office today... not bad! #sea pic.twitter.com/A7skHIcplS

— Francesco Tisiot (@FTisiot) August 7, 2017

Before going in detail, lets try to clarify the basics: what is KSQL? Why was it introduced and how does it complement Kafka?

What is KSQL?

We have been writing about Kafka several times, including my recent blogs were I was using it as data hub to capture Game of Thrones tweets and store them in BigQuery in order to do sentiment analysis with Tableau. In all our examples Kafka has been used just for data transportation with any necessary transformation happening in the target datastore like BigQuery, with the usage of languages like Python and engines like Spark Streaming or directly in the querying tool like Presto.

KSQL enables something really effective: reading, writing and transforming data in real-time and a scale using a semantic already known by the majority of the community working in the data space, the SQL!

 Streaming SQL for Apache Kafka

KSQL is now available as developer preview, but the basic operations like joins, aggregations and event-time windowing are already covered.

What Problem is KSQL Solving?

As anticipated before, KSQL solve the main problem of providing a SQL interface over Kafka, without the need of using external languages like Python or Java.
However one could argue that the same problem was solved before by the ETL operations made on the target datastores like Oracle Database or BigQuery. What is the difference then in KSQL approach? What are the benefits?

The main difference in my opinion is the concept of continuous queries: with KSQL transformations are done continuously as new data arrives in the Kafka topic. On the other side transformations done in a database (or big data platforms like BigQuery) are one off and if new data arrives the same transformation has to be executed again.

 Streaming SQL for Apache Kafka

So what is KSQL good for? Confluent's KSQL introduction blog post provides some use cases like real time analytics, security and anomaly detection, online data integration or general application development. From a generic point of view KSQL is what you should use when transformations, integrations and analytics need to happen on the fly during the data stream. KSQL provides a way of keeping Kafka as unique datahub: no need of taking out data, transforming and re-inserting in Kafka. Every transformation can be done Kafka using SQL!

As mentioned before KSQL is now available on developer preview and the feature/function list is somehow limited compared to more mature SQL products. However in cases where very complex transformations need to happen those can still be solved either via another language like Java or a dedicated ETL (or view) once the data is landed in the destination datastore.

How does KSQL work?

So how does KSQL work under the hood? There are two concepts to keep in mind: streams and tables. A Stream is a sequence of structured data, once an event was introduced into a stream it is immutable, meaning that it can't be updated or deleted. Imagine the number of items pushed or pulled from a storage: "e.g. 200 pieces of ProductA were stocked today, while 100 pieces of ProductB were taken out".
A Table on the other hand represents the current situation based on the events coming from a stream. E.g. what's the overall quantity of stocks for ProductA? Facts in a table are mutable, the quantity of ProductA can be updated or deleted if ProductA is not anymore in stock.

 Streaming SQL for Apache Kafka

KSQL enables the definition of streams and tables via a simple SQL dialect. Various streams and tables coming from different sources can be joined directly in KSQL enabling data combination and transformation on the fly.

Each stream or table created in KSQL will be stored in a separate topic, allowing the usage of the usual connectors or scripts to extract the informations from it.

KSQL in Action Starting KSQL

KSQL can work both in standalone and client-server mode with the first one aimed at development and testing scenarios while the second supporting production environments.
With the standalone mode KSQL client and server are hosted on the same machine, in the same JVM. On the other side, in client-server mode, a pool of KSQL server are running on remote machine and the client connects to them over HTTP.

For my test purposes I decided to use the standalone mode, the procedure is well explained in confluent documentation and consist in three steps:

  • Clone the KSQL repository
  • Compile the code
  • Start KSQL using local parameter
./bin/ksql-cli local
Analysing OOW Tweets

I'll use for my example the same Twitter producer created for my Wimbledon post. If you notice I'm not using the Kafka Connect, this is due to KSQL not supporting AVRO formats as of now (remember is still in dev phase?). I had then to rely on the old producer which stored the tweet in JSON format.

For my tests I've been filtering the tweets containing OOW17 and OOW (Oracle Open World 2017), and as mentioned before, those are coming in JSON format and stored in a Kafka topic named rm.oow. The first step is then to create a Stream on top of the topic in order to structure the data before doing any transformation.
The guidelines for the stream definition can be found here, the following is a cutdown version of the code used

CREATE STREAM twitter_raw ( \  
  Created_At VARCHAR, \
  Id BIGINT, \
  Text VARCHAR, \
  Source VARCHAR, \
  Truncated VARCHAR, \
  User VARCHAR, \
  Retweet VARCHAR, \
  Contributors VARCHAR, \
  ...) \
WITH ( \  
  kafka_topic='rm.oow', \
  value_format='JSON' \

Few things to notice:

  • Created_At VARCHAR: Created_At is a timestamp, however in the first stream definition I can't apply any date/timestamp conversion. I keep it as VARCHAR which is one of the allowed types (others are BOOLEAN, INTEGER, BIGINT, DOUBLE, VARCHAR, ARRAY<ArrayType> and MAP<VARCHAR, ValueType>).
  • User VARCHAR: the User field is a JSON nested structure, for the basic stream definition we'll leave it as VARCHAR with further transformations happening later on.
  • kafka_topic='rm.oow': source declaration
  • value_format='JSON': data format

Once created the first stream we can then query it in SQL like

select Created_at, text from twitter_raw  

with the output being in the form of a continuous flow: as soon as a new tweet arrives its visualized in the console.

 Streaming SQL for Apache Kafka

The first part I want to fix now is the Created_At field, which was declared as VARCHAR but needs to be mutated into timestamp. I can do it using the function STRINGTOTIMESTAMP with the mask being EEE MMM dd HH:mm:ss ZZZZZ yyyy. This function converts the string to a BIGINT which is the datatype used by Kafka to store timestamps.

Another section of the tweet that needs further parsing is the User, that as per the previous definition returns the whole nested JSON object.

"name":"Francesco Tisiot",
"location":"Verona, Italy","url":"http://it.linkedin.com/in/francescotisiot",

Fortunately KSQL provides the EXTRACTJSONFIELD function that we can then use to parse the JSON and retrieve the required fields

I can now define a new twitter_fixed stream with the following code

create stream twitter_fixed as  
  select STRINGTOTIMESTAMP(Created_At, 'EEE MMM dd HH:mm:ss ZZZZZ yyyy') AS  Created_At, \
    Id, \
    Text, \
    Source, \
    ..., \
    EXTRACTJSONFIELD(User, '$.name') as User_name, \
    EXTRACTJSONFIELD(User, '$.screen_name') as User_screen_name, \
    EXTRACTJSONFIELD(User, '$.id') as User_id, \
    EXTRACTJSONFIELD(User, '$.location') as User_location, \
    EXTRACTJSONFIELD(User, '$.description') as description \
  from twitter_raw

An important thing to notice is that the Created_At is not encoded as BigInt, thus if I execute select Created_At from twitter_fixed I get only the raw number. To translate it to a readable date I can use the STRINGTOTIMESTAMP function passing the column and the data format.

The last part of the stream definition I wanted to fix is the settings of KEY and TIMESTAMP: a KEY is the unique identifier of a message and, if not declared, is auto-generated by Kafka. However the tweet JSON contains the Id which is Twitter's unique identifier, so we should to use it. TIMESTAMP associates the message timestamp with a column in the stream: Created_At should be used. I can defined the two above in the WITH clause of the stream declaration.

create stream twitter_with_key_and_timestamp \  
as \  
select * from twitter_fixed \  
with \  
(KEY='Id', TIMESTAMP='Created_At');

When doing a select * from twitter_with_key_and_timestamp we can clearly see that KSQL adds two columns before the others containing TIMESTAMP and KEY and the two are equal to Created_At and Id.

 Streaming SQL for Apache Kafka

Now I have all the fields correctly parsed as KSQL stream, nice but in my previous blog post I had almost the same for free using Kafka Connect. Now It's time to discover the next step of KSQL: tables!

Let's first create a simple table containing the number of tweets by User_name.

create table tweets_by_users as \  
select user_screen_name, count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_screen_name  

When then executing a simple select * from table we can see the expected result.

 Streaming SQL for Apache Kafka

Two things to notice:

  • We see a new row in the console every time there is a new record inserted in the oow topic, the new row contains the updated count of tweets for the screen_name selected
  • The KEY is automatically generated by KSQL and contains the screen_name

I can retrieve the list of tables define with the show tables command.

 Streaming SQL for Apache Kafka

It's interesting to notice that the format is automatically set as JSON. The format property, configured via the VALUE_FORMAT parameter, defines how the message is stored in the topic and can either be JSON or DELIMITED.


When grouping, KSQL provides three different windowing functions:

  • Tumbling: Fixed size, non overlapping. The SIZE of the window needs to be specified.
  • Hopping: Fixed size, possibly overlapping. The SIZE and ADVANCE parameters need to be specified.
  • Session: Fixed size, starting from the first entry for a particular Key, it remains active until a new message with the same key happens within the INACTIVITY_GAP which is the parameter to be specified.

 Streaming SQL for Apache Kafka

I can create simple table definition like the number of tweets by location for each tumbling session with

create table rm.tweets_by_location \  
as \  
select user_location, \  
count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_location  

the output looks like

 Streaming SQL for Apache Kafka

As you can see the KEY of the table contains both the user_location and the window Timestamp (e.g Colombes : Window{start=1507016760000 end=-})

An example of hopping can be created with a similar query

create table rm.tweets_by_location_hopping \  
as \  
select user_location, \  
count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_location;  

With the output being like

 Streaming SQL for Apache Kafka

It's interesting to notice that each entry (e.g. Europe North, Switzerland) is listed at least three times. This is due to the fact that in any point in time there are three overlapping windows (SIZE is 30 seconds and ADVANCE is 10 seconds). The same example can be turn into the session windows by just defining WINDOW SESSION (30 SECONDS).

The windowing is an useful option, especially when combined with HAVING clauses since it gives the option to define metrics for real time analysis.
E.g. I may be interested only items that have been ordered more than 100 times in the last hour, or, in my twitter example in user_locations having a nr_of_tweets greater than 5 in the last 30 minutes.


So far so good, a nice set of SQL functions on top of data coming from a source (in my case twitter). In the real word however we'll need to mix information coming from disparate sources.... what if I tell you that you can achieve that in a single KSQL statement?

 Streaming SQL for Apache Kafka

To show an integration example I created a simple topic known_twitters using the kafka-console-producer.

./bin/kafka-console-producer --topic known_twitters --broker-list myserver:9092

Once started I can type in messages and those will be stored in the known_twitters topic. For this example I'll insert the twitter handle and real name of known people that are talking about OOW. The format will be:



FTisiot,Francesco Tisiot  
Nephentur,Christian Berg  

Once inserted the rows with the producer I'm then able to create a KSQL stream on top of it with the following syntax (note the VALUE_FORMAT='DELIMITED')

create stream people_known_stream (\  
screen_name VARCHAR, \  
real_name VARCHAR) \  
WITH (\  
KAFKA_TOPIC='known_twitters', \  

I can now join this stream with the others streams or tables built previously. However when trying the following statement

select user_screen_name from rm.tweets_by_users a join PEOPLE_KNOWN_STREAM b on a.user_screen_name=b.screen_name;  

I get a nice error

Unsupported join logical node: Left: io.confluent.ksql.planner.plan.StructuredDataSourceNode@6ceba9de , Right: io.confluent.ksql.planner.plan.StructuredDataSourceNode@69518572  

This is due to the fact that as of now KSQL supports only joins between a stream and a table, and the stream needs to be specified first in the KSQL query. If I then just swap the two sources in the select statement above:

select user_screen_name from PEOPLE_KNOWN_STREAM a join rm.tweets_by_users b on a.screen_name=b.user_screen_name;  

...I get another error

Join type is not supportd yet: INNER  

We have to remember that KSQL is still in developer beta phase, a lot of new features will be included before the official release.

adding a LEFT JOIN clause (see bug related) solves the issue and I should be able to see the combined data. However when running

select * from PEOPLE_KNOWN_STREAM left join TWEETS_BY_USERS on screen_name=user_screen_name;  

Didn't retrieve any rows. After adding a proper KEY to the stream definition

as select screen_name , \  
real_name from  people_known_stream \  
PARTITION BY screen_name;  

I was able to retrieve the correct rowset! Again, we are in early stages of KSQL, those fixes will be enhanced or better documented in future releases!


As we saw in this small example, all transformations, summaries and data enrichments were done directly in Kafka with a dialect very easy to learn for anyone already familiar with SQL. All the created streams/tables are stored as Kafka topics thus the standard connectors can be used for sink integration.

As mentioned above KSQL is still in developer preview but the overall idea is very simple and at the same time powerful. If you want to learn more check out the Confluent page and the KSQL github repository!

Categories: BI & Warehousing

ODC Appreciation Day: OBIEE's Time Hierarchies

Tue, 2017-10-10 01:58
 OBIEE's Time Hierarchies

After last year successful OTN Appreciation Day, it's time again to show our love for a particular feature in any Oracle's tool we use in our work. You may have noted a name change with OTN now becoming ODC: Oracle Developer Community.


The feature I want to speak about is OBIEE's Time Hierarchies.
For anybody in the BI business the time dimension(s) are the essence of the intelligence bit: being able to analyze trends, compare current period with previous one, plot year to date or rolling measures are just some of the requirements we get on daily basis.
A time hierarchy definition allows the administrator to set which time levels are exposed, how the rollup/drill down works and how previous/following members of the level are calculated.
Once the hierarchy is defined, all the related calculations are simple as calling a function (e.g. AGO), defining the level of detail necessary (e.g. Month) and the number of items to take into account (e.g. -1).

A Time hierarchy definition is necessary in the following cases:

  • Time comparisons - e.g. current vs previous month
  • Time related rollups - e.g. Year to date
  • Drill path definition - e.g. Year-Month-Day
  • Fact Tables at various level of details - e.g. daily fact table and monthly pre-aggregated rollup
  • Time related level based measures - e.g. monthly sum of sales coming from a fact table at daily level

Why do I like time hierarchies? Simple! It's a very clever concept in the RPD, which requires particular knowledge and dedicated attention.

If done wright, once defined, is available in every related table and makes the time comparison formulas easy to understand and to create. If done wrong, errors or slowness in the related analysis can be difficult to spot and improve/fix.

Still time hierarchies are a central piece in every BI implementation, so once understood and implemented correctly give a massive benefit to all developers.


We blogged about time dimensions and calculations back in 2007 when OBI was still on version 10! The original functionality is still there and the process to follow is pretty much the same.
Recently was introduced the concept of Logical Sequence Number, a way of speeding up some time series calculations by removing the ranking operations needed to move back (or forth) in history.

 OBIEE's Time Hierarchies

I wanted to keep the blog post short, since the time hierarchies information can be found in millions of blog posts. I just wanted the to give few hints to follow when creating a time hierarchy:

  • It can be created on any data with a predefined order, no need to be a date! you could compare for example a certain product with another in the inventory having the previous code.
  • The Chronological Key defines the sorting of the level, for example how years, months or dates are ordered. Ordering months alphabetically with a format like YYYY-MM it's correct while using MM-YYYY provides wrong results.
  • Double check the hierarchies, something like YEAR-> MONTH -> WEEK -> DATE can be incorrect since a week can be split in different months!
  • Set appropriately the number of elements for each level. This is useful, especially when the hierarchy is complex or pre-aggregated facts, for OBIEE to understand which table to query depending on the level of the analysis.
  • Setup the Logical Sequence Number. LSNs are useful if you are looking to reduce the impact of the time series processing at a minimum.
  • If you are looking for very optimal performances for a specific report, e.g. current year vs previous, physicalizing the time series result, previous year, directly in the table alongside with the current year will give what you're looking for.

This was just a quick overview of OBIEE's Time Hierarchies, why are so useful and what you should be looking after when creating them! Hope you found this short post useful.

Follow the #ThanksODC hashtag on Twitter to check which post have been published on the same theme!

Categories: BI & Warehousing

Unify Update - v1.0.1

Fri, 2017-09-22 07:54
Unify Update - v1.0.1

Unify Update - v1.0.1

We have updated Unify following feedback from our customers and have released version 1.0.1. The following bugs have been fixed and features added:

  • Change the default port to 3724 as 8080 is the default port of Oracle XE.
  • Allow port configuration in the desktop app.
  • Fixed problem with date filters not working with >, >=, <, <= operators.
  • Fixed some problems using presentation variables used in filters.
  • Made the preview table scale to the resolution of the screen instead of being fixed size.
  • Enabled parsing for dashboard pages, so an OBIEE page can be opened and each report from it will be loaded into Unify.
  • Made viewing column or filter panes optional in the UI.
  • Improved tray icons for Mac distribution of the Desktop app.
  • Distinguish measures and attributes with icons in the presentation layer.
  • Allow queries from multiple subject areas.
  • Switched to Tableau WDC 2.0.9 to facilitate compatbility with Tableau 10.0.

You download Unify from our website: https://unify.ritt.md

Categories: BI & Warehousing

Game of Thrones S07 Last Episode: The Summary

Fri, 2017-09-01 09:36
 The Summary

Watch the Episode First! It's a friendly suggestion...

The final #GoT episode was transmitted last Sunday, now two years waiting for the next season... How can HBO be so cruel??? And how can I find interesting content for my future blog posts???
At least now European football (not soccer) leagues are back, so TV-side I'm covered!


Going back to serious discussions, Game of Thrones last episode: Yay or Nay? The average sentiment for the episode (taking into account only tweets since Monday) was -0.012: it is negative but represents an improvement when compared to the two previous ones (with episode 6 having the most negative sentiment score).

 The Summary

But... Hey! What is the line on top going in time? The line it's due to the external R call and the fact that is forcing us to include the Tweet Text column in the analysis in order to be evaluated. The evaluation of the sentiment is applied on ATTR(Tweet Text) which means kind of SELECT DISTINCT Tweet_Text in Oracle terms. The line on top is drawn because the same Tweet Text was tweeted across several weeks.

 The Summary

Please notice that the three overall sentiments are close (between 0.01 and 0.10) so, when looking in detail at the distribution of sentiment scores across the episodes we can see that, as expected, are similar.

 The Summary

Zooming to single characters we can see the scatterplot of the last episode, with Jon Snow (or should I say Targaryen?) leading the number of mentions with surprisingly Littlefinger on the second spot and Arya on the third: probably the Baelish dying scene at Winterfell was something highly appreciated by the fans.

 The Summary

On the positive negative feeling almost nothing changed with Arya and the Night King being the negative and positive poles. I've been telling you about change of leadership on the various axes of the scatterplot by visually comparing today's scatterplot with the previous two. However the transition of the character position in the graph can be visualized again on multiple scatterplots.

 The Summary

By creating a scatterplot for each character and assigning to the episodes a different number (E05-1, E06-2, E07-3) I can clearly see how Davos Seaworth for example had a big sentiment variation going very positive in the last episode while Jaime Lanninster was more stable. Zooming into Davos position we can see how the sentiment distribution changed across episodes with the E06 representing the most negative while the E07 has almost all positive tweets.

 The Summary

Looking at the words composing Davos tweets we can immediately spot few thigs:

 The Summary

  • SIR has a positive sentiment (Sir Davos is how several characters call him) which is driving the overall score in the final episode
  • The number of tweets mentioning Davos was very small in E06 compared to the other two (we can see the same from the related scatterplot above)
  • In E07 we see a good number of circles having the same (big) size, possibly is the same text which has been tweeted several times.

To verify the last point we can simply show the Tweet Text along the # of Tweets and discover that almost the same positive Text count for over the 99% of the whole reference to the character.

 The Summary


One of the cool functions of the Syuzhet package is named get_nrc_sentiment and allows the extrapolation of emotions from a text based on the NRC emotion lexicon. The function takes a text as input and returns a data frame containing a row for each sentence and a column for emotion or sentiment.
The sentiment can either be positive or negative which we already discussed a lot previously. The emotion is split in eight categories: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust.

We can extract the eight different emotions into eight calculations with the following code


To calculate the Anger Emotion Score we are passing ATTR(Text), the list of Tweet's texts, and taking the output of the anger column of the dataframe. We can do the same for all the other emotions and create separate graphs to show their average across characters for the last episode. In this case I took Disgust, Anger, Fear, Joy and Trust.

 The Summary

We can then clearly see that Bran Stark is the character that has most Disgust associated to. Bron has a special mix of emotions, he's in the top for Anger, Fear and Joy, such a mix can justify the average sentiment which is close to neutral (see scatterplot above). On the Trust side we can clearly see that the North wins with Arya and Sansa on the top, interesting here is to see also Lord Varys.
Looking into Bran Disgust detail we can see that is driven by the categorization of the BRAN word as disgusting, probably the dictionary doesn't like cereals.

 The Summary

Scene Emotions

In my previous post I've been talking about the "Game of Couples" and how a single character sentiment score could be impacted by a reference to a second character. For the last episode of the series I wanted to look at different scenes: the main characters I want to analyse are Jon Snow, Littlefinger and Sansa. Specifically I want to understand how people on Twitter reacted to the scenes where the two characters had a big impact: the death of Littlefinger declared by Sansa and the revelation of Jon Targaryen.

The first thing I wanted to check is the Surprise: How are characters categorized by this emotion? We can see Bron on top being driven by the word GOOD in the related tweets.

 The Summary

We can also notice that Petyr score is quite high (0.2590 and 2nd position) while Jon score is pretty low, probably averaged by the huge number of tweets. We can also see that Sansa score is not very high, even if she is the character providing quite a big shock when accusing Littlefinger.

The overall character average surprise doesn't seem to be very relevant, we need to find a way to filter tweets related to those particular scenes: we can do that by including only few keywords in the Tweet Text. Please note we are going to filter words that will create an OR condition. If a tweet contain ANY of the words mentioned, it will be included.

First I wanted to check which are the words in Jon's tweets driving the Surprise sentiment alongside the # of Tweets

 The Summary

However this is only giving us details on which words are classified as Surprise for Jon, nothing really related to the scenes. I can however filter only the tweets with an overall Surprise sentiment for Jon and check which words are mostly associated with them. I also added a filter for Tweets containing the words TARGARYEN OR SON since I assumed those two could be more frequently used describing the scene.

 The Summary

We can clearly see some patterns that are well recognized correctly by the Surprise metric: both Aegon (a reference to Jon's real name) and Aunt (reference to Lyanna or Deanerys?) are in the top 20 and a little bit further right in the graph we can also spot Father. There probably is also some surprise in tweets related to what's going to happen when Jon finds out he's a Targaryen since all keywords are present in the top 20.

When doing a similar analysis on Sansa I wanted to add another metric to the picture: the Average Sentence Emotion Score for all sentences including a word. With this metric we can see how a word (for example AMAZING) changes the average emotion of the sentences where is included. Analysing this metric alone however wouldn't be useful: obviously the words having more impact on emotion are the ones categorized as such in the related dictionary.

I found interesting the following view for Sansa: we see across all the tweets categorized as Surprizing, which are the words most mentioned (Y-axes) and what's the average Surprise emotion value for the sentences were those words were included.

 The Summary

We can spot that MURDER and TREASON were included with a big number of tweets (>500) having an average Surprise score around 2. This seems to indicate that the scene of Sansa convicting Lord Baelish wasn't expected from the fans.

One last graph shows how the character couples (remember the game of couples in my previous post?) have been perceived: the square color defines the average Surprise score while the position in the X-axis confidence (by the # of Tweets).

 The Summary

We can spot that the couple Cercei and Sansa is the one having most Surprise emotion, followed by Cercei and Daenerys. Those two couples may be expected since the single characters had major parts in the last episode. Something unexpected is the couple Sandor Clegane and Brienne, looking in detail, the surprise is driven by a mention to the word MURDER which is included in 57.76% of the Tweets mentioning both.

 The Summary

A last technical note: during the last few weeks I've collected about 700 thousands tweets, the time to analyse them highly depends on the complexity of the query. For simple counts or sums based only on BigQuery data I could obtain replies in few seconds. For other analysis, especially when sentiment or emotion was included, a big portion of the raw dataset was retrieved from BigQuery into Tableau, passed to R with the function results moved back to Tableau to be displayed. Those queries could take minutes to be evaluated.
As written in my previous blog post, the whole process could be speed up only by pre-processing the data and storing the sentiment/emotion in BigQuery alongside with the data.

 The Summary

my series of blog post about Game of Thrones tweet and press analysis with Kafka, BigQuery and Tableau! See you in two years for the analysis of the next season with probably a whole new set of technology!


Categories: BI & Warehousing

The Week After: Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Fri, 2017-08-25 09:56
 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Another week is gone, another "Game of Thrones" episode watched, only one left until the end of the 7th series.
The "incident" in Spain, with the episode released for few hours on Wednesday screwed all my plans to do a time-wise comparison between episodes across several countries.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I was then forced to think about a new action plan in order avoid disappointing all the fans who enjoyed my previous blog post about the episode 5. What you'll read in today's analysis is based on the same technology as before: Kafka Connect source from Twitter and Sink to BigQuery with Tableau analysis on top.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

What I changed in the meantime is the data structure setup: in the previous part there was a BigQuery table rm_got containing #GoT tweets, an Excel table containing Keywords for each character together with the Name and the Family (or House). Finally there was a view on top of BigQuery rm_got table extracting all the words of each tweet in order to analyse their sentiment.
For this week analysis I tried to optimise the dataflow, mainly pushing data into BigQuery, and I added a new part to it: online press reviews analysis!


As mentioned during my previous post, the setup described before was miming an analyst workflow, without writing access to datasource. However it was far from optimal performance wise, since there was a cartesian join between two data-sources, meaning that for every query all the dataset was extracted from BigQuery and then joined in memory in Tableau even if filters for a specific character were included.

The first change was pushing the characters Excel data in BigQuery, so at least we could use the same datasource joins instead of relying on Tableau's data-blend. This has the immediate benefit of running joins and filters in the datasource rather than retrieving all data and filtering locally in memory.
Pushing Excel data into BigQuery is really easy and can be done directly in the web GUI, we just need to transform the data in CSV which is one of allowed input data formats.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Still this modification alone doesn't resolve the problem of the cartesian join between characters (stored in rm_characters) and the main rm_got table since also BigQuery native joining conditions don't allow the usage of the CONTAIN function we need to verify that the character Key is contained in the Tweet's Text.
Luckily I already had the rm_words view, used in the previous post, splitting the words contained in the Tweet Text into multiple rows. The view contained the Tweet's Id and could be joined with the characters data with a = condition.

However my over simplistic first implementation of the view was removing only # and @ characters from the Tweet text, leaving all the others punctuation signs in the words as you can see in the image below.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I replaced the old rm_words view code with the following

SELECT  id, TEXT, SPLIT(REGEXP_REPLACE(REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'),'[^a-zA-Z]',' '),' ')  f0__group.word FROM [big-query-ftisiot:BigQueryFtisiotDataset.rm_got]  

Which has two benefits:

  • REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'): Since I'm splitting words, I don't want to miss references to the Night King which is composed by two words that even if written separated point the same character.
  • REGEXP_REPLACE(..,'[^a-zA-Z]',' '): Replaces using regular expression, removing any character apart from the letters A-Z in lower and upper case from the Tweet Text.

The new view definition provides a clean set of words that can finally be joined with the list of characters keys. The last step I did to prepare the data was to create an unique view containing all the fields I was interested for my analysis with the following code:

  [DataSet.rm_got] AS rm_got JOIN
  [DataSet.rm_words] AS rm_words ON rm_got.id=rm_words.id JOIN 
  (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) AS rm_words_char ON rm_got.id=rm_words_char.id JOIN 
  [DataSet.rm_charachters] AS characters ON rm_words_char.f0__group.word = characters.Key

Two things to notice:

  • The view rm_words is used two times: one, as mentioned before, to join the Tweet with the character data and one to show all the words contained in a tweet.
  • The (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) subselect is required since word column, contained in rm_words, was a repeated field, that can't be used in joining condition if not flatten.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Please note that the SQL above will still duplicate the Tweet rows, in reality we'll have a row for each word and different character Key contained in the Text itself. Still this is a big improvement from the cartesian join we used in our first attempt.

One last mention to optimizations: currently the sentence and word sentiment is calculated on the fly in Tableau using the SCRIPT_INT function. This means that data is extracted from BigQuery into Tableau, then passed to R (running locally in my pc) which computes the score and then returns it to Tableau. In order to optimize Tableau performance I could pre-compute the scores in R and push them in a BigQuery Table but this would mean a pre-processing step that I wanted to avoid since a real-time analysis was one of my purposes.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Tweet Analysis

With my tidy dataset in place, I can now start the analysis and, as the previous week I can track various KPIs like the mentions by character Family and Name. To filter only current week data I created two parameters Start Date of Analysis and End Date of Analysis

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Using those parameters I can filter which days I want to include in my analysis. To apply the filter in the Workbook/Dashboard I created also a column Is Date of Analysis with the following formula

IIF(DATE([CreatedAt]) >= [Start Date of Analysis]  
AND DATE([CreatedAt]) <= [End Date of Analysis]  

I can now use the Is Date of Analysis column in my Workbooks and filter the Yes value to retain only the selected dates.

I built a dashboard containing few of the analysis mentioned in my previous blog post, in which I can see the overall scatterplot of characters by # of Tweets and Sentence Sentiment and click on one of them to check its details regarding the most common words used and sentence sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

From the scatterplot on top we can see a change of leadership in the # of Tweets with Daenerys overtaking Jon by a good margin, saving him and in the meantime loosing one of the three dragons was a touching moment in the episode. When clicking on Daenerys we can see that the world WHITE is driving also the positive sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The Night King keep its leadership on the Sentiment positive side. Also in this case the WHITE word being the most used with positive sentiment. On the other side Arya overtook Sansa as character with most negative mentions. When going in detail on The positive/negative words, we can clearly see that STARK (mentioned in previous episode), KILL, WRONG and DEATH are driving the negative sentiment. Interesting is also the word WEAR with negative sentiment (from Google dictionary "damage, erode, or destroy by friction or use.").

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

A cut down version of the workbook with a limited dataset, visible in the image below, is available in Tableau Public.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Game of Couples

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

This comparison is all what I promised towards the end of my first post, so I could easily stop here. However as curious person and #GoT fan myself I wanted to know more about the dataset and in particular analyse how character interaction affect sentiment. To do so I had somehow to join characters together if they were mentioned in the same tweet, luckily enough my dataset contained the character mentioned and the list of words of each Tweet. I can reuse the list of words on a left join with the list of characters keys. In this way I have a record for each couple of characters mentioned in a Tweet.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

I can then start analysing the Tweets mentioning any couple of characters, with the # of Tweets driving the gradient. As you can see I removed the values where the column and row is equal (e.g. Arya and Arya). The result, as expected, is a symmetric matrix since the # of Tweets mentioning Arya and Sansa is the same as the ones mentioning Sansa and Arya.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can clearly see that Jon and Daenerys are the most mentioned couple with Sansa and Arya following and in third place Whitewalkers and Bran. This view and the insights we took from it could be problematic to get in cases when the reader is colour blind or has troubles when defining intensity. For those cases a view like the below provides the same information (by only switching the # of Tweets column from Color to Size), however it has the drawback that small squares are hard to see.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The next step in my "couple analysis" is understand sentiment, and how a second character mentioned in the same tweet affects the positive/negative score of a character. The first step I did is showing the same scatterplot as before, but filtered for a single character, in this case Arya.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The graph shows Arya's original position, and how the Sentiment and the # of Tweets change the position when another character is included in the Tweet. We can see that, when mentioned with Daenerys the sentiment is much more positive, while when mentioned with Bran or Littlefinger the sentiment remains almost the same.

This graph it's very easy to read, however it has the limitation of being able to display only one character behaviour at time (in this case Arya). What I wanted is to show the same pattern across all characters in a similar way as when analysing the # of Tweets per couple. To do so I went back to a matrix stile of visualization, setting the colour based on positive (green) or negative (red) sentiment.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

As before the matrix is symmetric, and provides us a new set of insights. For example, when analysing Jorah Mormont, we can see that a mention together with Cercei is negative which we can somehow expect due to the nature of the queen. What's strange is that also when Jorah is mentioned with Samwell Tarly there is a negative feeling. Looking deeply in the data we can see that it's due to a unique tweet containing both names with a negative sentiment score.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

What's missing in the above visualization is an indication on how "strong" is the relationship between two character based on the # of Tweets where they are mentioned together. We can add this by including the # of Tweets as position of the sentiment square. The more the square is moved towards the right the higher is the # of Tweets mentioning the two characters together.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can see as before that Jorah and Sam have a negative feeling when mentioned together, but it's not statistically significant because the # of Tweets is very limited (square position completely on the left). Another example is Daenerys and Jon which have a lot of mentions together with a neutral sentiment. As we saw before also the couple Arya and Bran when mentioned together have a negative feeling, with a limited number Tweets mentioning them together. However Bran mentioned with WhiteWalkers has a strong positive sentiment.

It's worth mentioning that the positioning of the dot is based on a uniform scale across the whole matrix. This means that if, like in our case, there is a dominant couple (Daenerys and Jon) mentioned by a different order of magnitude of # of Tweets compared to all other couples, the difference in positioning of all the others dots will be minimal. This could however be solved using a logarithmic scale.

Web Scraping

Warning: all the analysis done in the article including this chapter are performed with automated tools. Due to the nature of the subject (a TV series plenty of deaths, battles and thrilling scenes) the words used to describe a sentence could be automatically classified as positive/negative. This doesn't automatically mean that the opinion of the writer is either positive or negative about the scene/episode/series.

The last part of the analysis I had in mind was about comparing the Tweets sentiment, with the same coming from the episode reviews that I could find online. This latter part relies a lot on the usage of R to scrape the relevant bits from the web-pages, the whole process was:

  • Search on Google for Beyond the Wall Reviews
  • Take the top N results
  • Scrape the review from the webpage
  • Tokenize the review in sentences
  • Assign the sentence score using the same method as in Tableau
  • Tokenize the sentence in words
  • Upload the data into BigQuery for further analysis

Few bits on the solution I've used to accomplish this since the reviews are coming from different websites with different tags, classes and Ids, I wasn't able to write a general scraper for all websites. However each review webpage I found had the main text divided in multiple <p> tags under a main <div> tag which had an unique Id or class. The R code simply listed the <div> elements, found the one mentioning the correct Id or class and took all the data contained inside the <p> elements. A unique function is called with three parameters: website, Id or class to look for, and SourceName (e.g. Telegraph). The call to the function is like

sentence_df <- scrapedata("http://www.ign.com/articles/2017/08/21/game-of-thrones-beyond-the-wall-review",'Ign',"article-content")  

It will return a dataframe containing one row per <p> tag, together with a mention of the source (Ign in this case).

The rest of the R code tokenizes the strings and the words using the tokenizers package and assigns the related sentiment score with the syuzhet package used in my previous blog post. Finally it creates a JSON file (New Line Delimited) which is one of the input formats accepted by BigQuery.
When the data is in BigQuery, the analysis follows the same approach as before with Tableau connecting directly to BigQuery and using again R for word sentiment scoring.

The overall result in Tableau includes a global Episode sentiment score by Source, the usual scatterplot by character and the same by Source. Each of the visualizations can act as filter for the others.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can clearly see that AVClub and Indiewire had opposite feelings about the episode. Jon Snow is the most mentioned character with Arya and Sansa overtaking Daenerys.

The AVClub vs Indiewire scoring can be explained by the sencence sentiment categorization. Indiewire had most negative sentences (negative evaluations) while the distribution of AVClub has its peak on the 1 (positive) value.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Checking the words used in the two Sources we can notice as expected a majority of positive for AVClub while Indiewire has the overall counts almost equal.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Going in detail on the words, we can see the positive sentiment of AVClub being driven by ACTION, SENSE, REUNION while Indiewire negative one due to ENEMY, BATTLE, HORROR.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

This is the automated overall sentiment analysis, if we read the two articles from Indiewire and AVClub in detail we can see that the overall opinion is not far from the automated score:

From AVClub

On the level of spectacle, “Beyond The Wall” is another series high point, with stellar work ....

From IdieWire

Add to the list “Beyond the Wall,” an episode that didn’t have quite the notable body count that some of those other installments did

To be fair we also need to say that IdieWire article is focused on the war happening and the thrilling scene with the Whitewalkers where words like ENEMY, COLD, BATTLE, DEATH which have a negative sentiment are actually only used to describe the scene and not the feelings related to it.

Character and Review Source Analysis

The last piece of analysis is related to single characters. As mentioned before part of the dashboard built in Tableau included the Character scatterplot and the Source scatterplot. By clicking on a single Character I can easily filter the Source scatterplot, like in this case for Daenerys.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We can see how different Sources have different average sentiment score for the same character, in this case with Mashable being positive while Pastemagazine negative.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Checking the words mentioned we can clearly see a positive sentiment related to PRESENT, AGREED and RIDER for Mashable while the negative sentiment of Pastemagazine is driven by FIGHT, DANGER, LOOSING. As said before just few words of difference describing the same scene can make the difference.

Finally, one last sentence for the very positive sentiment score for Clegor Clegaine: it is partially due to the reference to his nickname, the Mountain, which is used as Key to find references. The mountain is contained in a series of sentences as reference to the place where the group of people guided by Jon Snow are heading in order to find the Whitewalkers. We could easily remove MOUNTAIN from the Keywords to eliminate the mismatch.

 Game of Thrones S07 E06 Tweets and Press Reviews Analysis

We are at the end of the second post about Game of Thrones analysis with Tableau, BigQuery and Kafka. Hope you didn't get bored...see you next week for the final episode of the series! And please avoid waking up with blue eyes!


Categories: BI & Warehousing

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Thu, 2017-08-17 10:54
How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I don't trust statistics and personally believe that at least 74% of them are wrong.... but I bet nearly 100% of people with any interest in fantasy (or just any) TV shows are watching the 7th series of Game of Thrones (GoT) by HBO.
If you are one of those, join me in the analysis of the latest tweets regarding the subject. Please be also aware that, if you're not on the latest episode, some spoilers may be revealed by this article. My suggestion is then to go back and watch the episodes first and then come back here for the analysis!

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

If you aren't part of the above group then ¯\_(ツ)_/¯. Still this post contains a lot of details on how to perform analysis on any tweet with Tableau and BigQuery together with Kafka sources and sink configurations. I'll leave to you to find another topic to put all this in practice.

Overall Setup

As described in my previous post on analysing Wimbledon tweets I've used Kafka for the tweet extraction phase. In this case however, instead of querying the data directly in Kafka with Presto, I'm landing the data into a Google BigQuery Table. The last step is optional, since as in last blog I was directly querying Kafka, but in my opinion represents the perfect use case of all technologies: Kafka for streaming and BigQuery for storing and querying data.
The endpoint is represented by Tableau, which has a native connector to BigQuery. The following image represents the complete flow

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

One thing to notice: at this point in time I'm using a on-premises installation of Kafka which I kept from my previous blog. However since source and target are natively cloud application I could easily move also Kafka in the cloud using for example the recently announced Confluent Kafka as a Service.

Now let's add some details about the overall setup.


For the purpose of this blog post I've switched from the original Apache Kafka distribution to the Confluent open source one. I've chosen the Confluent distribution since it includes the Kafka Connect which is

A framework for scalably and reliably streaming data between Apache Kafka and other data systems

Using this framework anybody can write a connector to push data from any system (Source Connector) to Kafka or pull data from it (Sink Connector). This is a list of available connectors developed and maintained either from Confluent or from the community. Moreover Kafka Connect provides the benefit of parsing the message body and storing it in Avro format which makes it easier to access and faster to retrieve.

Kafka Source for Twitter

In order to source from Twitter I've been using this connector. The setup is pretty easy: copy the source folder named kafka-connect-twitter-master under $CONFLUENT_HOME/share/java and modify the file TwitterSourceConnector.properties located under the config subfolder in order to include the connection details and the topics.

The configuration file in my case looked like the following:


# Set these required values

Few things to notice:

  • process.deletes=false: I'll not delete any message from the stream
  • kafka.status.topic=rm.got: I'll write against a topic named rm.got
  • filter.keywords=#got,gameofthrones,stark,lannister,targaryen: I'll take all the tweets with one of the following keywords included. The list could be expanded, this was just a test case.

All the work is done! the next step is to start the Kafka Connect execution via the following call from $CONFLUENT_HOME/share/java/kafka-connect-twitter

$CONFLUENT_HOME/bin/connect-standalone config/connect-avro-docker.properties config/TwitterSourceConnector.properties

I can see the flow of messages in Kafka using the avro-console-consumer command

./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 --property schema.registry.url=http://localhost:8081 --property print.key=true --topic twitter --from-beginning

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

You can see (or maybe it's a little bit difficult from the GIF) that the message body was transformed from JSON to AVRO format, the following is an example

"Text":{"string":"RT @haranatom: Daenerys Targaryen\uD83D\uDE0D https://t.co/EGQRvLEPIM"},
Kafka Sink to BigQuery

Once the data is in Kafka, the next step is push it to the selected datastore: BigQuery. I can rely on Kafka Connect also for this task, with the related code written and supported by the community and available in github.

All I had to do is to download the code and change the file kcbq-connector/quickstart/properties/connector.properties

# The name of the BigQuery project to write to
# The name of the BigQuery dataset to write to (leave the '.*=' at the beginning, enter your
# dataset after it)
# The location of a BigQuery service account JSON key file

The changes included:

  • the topic name to source from Kafka
  • the project, dataset and Keyfile which are the connection parameters to BigQuery. Note that the Keyfile is automatically generated when creating a BigQuery service.

After verifying the settings, as per Kafka connect instructions, I had to create the tarball of the connector and extract it's contents

cd /path/to/kafka-connect-bigquery/  
./gradlew clean confluentTarBall
mkdir bin/jar/ && tar -C bin/jar/ -xf bin/tar/kcbq-connector-*-confluent-dist.tar  

The last step is to launch the connector by moving into the kcbq-connector/quickstart/ subfolder and executing


Note that you may need to specify the CONFLUENT_DIR if the Confluent installation home is not in a sibling directory

export CONFLUENT_DIR=/path/to/confluent  

When everything start up without any error a table named rm_got (the name is automatically generated) appears in the BigQuery dataset I defined previously and starts populating.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

A side note: I encountered a Java Heap Space error during the run of the BigQuery sink. This was resolved by increasing the heap space setting of the connector via the following call

export KAFKA_HEAP_OPTS="-Xms512m -Xmx1g"  

BigQuery, based on Dremel's paper, is Google's proposition for an enterprise cloud datawarehouse which combines speed and scalability with separate pricing for storage and compute. If the cost of storage is common knowledge in the IT world, the compute cost is a fairly new concept. What this means is that the cost of the same query can vary depending on how the data is organized. In Oracle terms, we are used to associating the query cost to the one defined in the explain plan. In BigQuery that concept is translated from "performance cost" to also "financial cost" of a query: the more data a single query has to scan, the higher is the cost for it. This makes the work of optimizing data structures not only visible performance wise but also on the financial side.

For the purpose of the blog post, I had almost 0 settings to configure other than creating a Google Cloud Platform, creating a BigQuery project and a dataset.

During the Project creation phase, a Keyfile is generated and stored locally on the computer. This file contains all the credentials needed to connect to BigQuery from any external application, my suggestion is to store it in a secure place.

  "type": "service_account",
  "project_id": "<PROJECT_ID>",
  "private_key_id": "<PROJECT_KEY_ID>",
  "private_key": "<PRIVATE_KEY>",
  "client_email": "<E-MAIL>",
  "client_id": "<ID>",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "<URL>"

This file is used in the Kafka sink as we saw above.


Once the data is landed in BigQuery, It's time to analyse it with Tableau!
The Connection is really simple: from Tableau home I just need to select Connect-> To a Server -> Google BigQuery, fill in the connection details and select the project and datasource.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

An important feature to set is the Use Legacy SQL checkbox in the datasource definition. Without this setting checked I wasn't able to properly query the BigQuery datasource. This is due to the fact that "Standard SQL" doesn't support nested columns while Legacy SQL (also known as BigQuery SQL) does, for more info check the related tableau website.

Analysing the data

Now it starts the fun part: analysing the data! The integration between Tableau and BigQuery automatically exposes all the columns of the selected tables together with the correctly mapped datatypes, so I can immediately start playing with the dataset without having to worry about datatype conversions or date formats. I can simply include in the analysis the CreatedAt date and the Number of Records measure (named # of Tweets) and display the number of tweets over time.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Now I want to analyse where the tweets are coming from. I can use using the the Place.Country or the Geolocation.Latitude and Geolocation.Longitude fields in the tweet detail. Latitute and Longitude are more detailed while the Country is rolled up at state level, but both solutions have the same problem: they are available only for tweets with geolocation activated.

After adding Place.Country and # of Tweets in the canvas, I can then select the map as visualization. Two columns Latitude (generated) and Longitude (generated) are created on the fly mapping the country locations and the selected visualization is shown.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

However as mentioned before, this map shows only a subset of the tweets since the majority of tweets (almost 99%) has no location.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The fields User.Location and User.TimeZone suffer from a different problem: either are null or the possible values are not coming from a predefined list but are left to the creativity of the account owner which can type whatever string. As you can see, it seems we have some tweets coming from directly from Winterfell, Westeros, and interesting enough... Hogwarts!

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Checking the most engaged accounts based on User.Name field clearly shows that Daenerys and Jon Snow take the time to tweet between fighting Cercei and the Whitewalkers.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The field User.Lang can be used to identify the language of the User. However, when analysing the raw data, it can be noticed that there are language splits for regional language settings (note en vs en-gb). We can solve the problem by creating a new field User.Lang.Clean taking only the first part of the string with a formula like

IF  FIND([User.Lang],'-') =0  
    THEN [User.Lang] 

With the interesting result of Italian being the 4th most used language, overtaking portuguese, and showing the high interest in the show in my home country.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Character and House Analysis

Still with me? So far we've done some pretty basic analysis on top of pre-built fields or with little transformations... now it's time to go deep into the tweet's Text field and check what the people are talking about!

The first thing I wanted to do is check mentions about the characters and related houses. The more a house is mentioned, the more should be relevant correct?
The first text analysis I want to perform was Stark vs Targaryen mention war: showing how many tweets were mentioning both, only one or none of two of the main houses. I achieved it with the below IF statement

IF contains(upper([Text]), 'STARK') AND contains(upper([Text]),'TARGARYEN')  
 THEN 'Both' 
 ELSEIF contains(upper([Text]), 'STARK') 
  THEN 'Stark' 
 ELSEIF contains(upper([Text]), 'TARGARYEN') 
  THEN 'Targaryen' 
 ELSE 'None' 

With the results supporting the house Stark

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I can do the same at single character level counting the mentions on separate columns like for Jon Snow

IIF(contains(upper([Text]), 'JON')  
OR contains(upper([Text]),'SNOW'), 1,0)  

Note the OR condition since I want to count as mentions both the words JON and SNOW since those can uniquely be referred at the same character. Similarly I can create a column counting the mentions to Arya Stark with the following formula

IIF(contains(upper([Text]), 'ARYA'), 1,0)  

Note in this case I'm filtering only the name (ARYA) since Stark can be a reference to multiple characters (Sansa, Bran ...). I created several columns like the two above for some characters and displayed them in a histogram ordered by # of Mentions in descending order.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

As expected, after looking at the Houses results above, Jon Snow is leading the user mentions with a big margin over the others with Daenerys in second place.

The methods mentioned above however have some big limitations:

  • I need to create a different column for every character/house I want to analyse
  • The formula complexity increases if I want to analyse more houses/characters at the same time

My goal would be to have an Excel file, where I set the research Key (like JON and SNOW) together with the related character and house and mash this data with the BigQuery table.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The joining key would be like

CONTAINS([BigQuery].[Text], [Excel].[Key]) >0  

Unfortunately Tableau allows only = operators in text joining conditions during data blending making the above syntax impossible to implement. I have now three options:

  • Give Up: Never if there is still hope!
  • Move the Excel into a BigQuery table and resolve the problem there by writing a view on top of the data: works but increases the complexity on BigQuery side, plus most Tableau users will not have write access to related datasources.
  • Find an alternative way of joining the data: If the CONTAINS join is not possible during data-blending phase, I may use it a little bit later...

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Warning: the method mentioned below is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

Without the option of using the CONTAINS I had to create a cartesian join during data-blending phase. By using a cartesian join every row in the BigQuery table is repeated for every row in the Excel table. I managed to create a cartesian join by simply put a 1-1 condition in the data-blending section.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

I can then apply a filter on the resulting dataset to keep only the BigQuery rows mentioning one (or more) Key from the Excel file with the following formula.


This formula filters the tweet Id where the Excel's [Key] field is contained in the UPPER([Text]) coming from Twitter. Since there are multiple Keys assigned to the same character/house (see Jon Snow with both keywords JON and SNOW) the aggregation for this column is count distinct which in Tableau is achieved with COUNTD formula.
I can now simply drag the Name from the Excel file and the # of Mentions column with the above formula and aggregation method as count distinct.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The beauty of this solution is that now if I need to do the same graph by house, I don't need to create columns with new formulas, but simply remove the Name field and replace it with Family coming from the Excel file.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Also if I forgot a character or family I simply need to add the relevant rows in the Excel lookup file and reload it, nothing to change in the formulas.

Sentiment Analysis

Another goal I had in mind when analysing GoT data was the sentiment analysis of tweets and the average sentiment associated to a character or house. Doing sentiment analysis in Tableau is not too hard, since we can reuse already existing packages coming from R.

For the Tableau-R integration to work I had to install and execute the RServe package from a workstation where R was already installed and set the connection in Tableau. More details on this configuration can be found in Tableau documentation

Once configured Tableau to call R functions it's time to analyse the sentiment. I used Syuzhet package (previously downloaded) for this purpose. The Sentiment calculation is done by the following formula:

r<-(get_sentiment(.arg1,method = 'nrc'))",  


  • SCRIPT_INT: The method will return an integer score for each Tweet with positives sentiments having positives scores and negative sentiments negative scores
  • get_sentiment(.arg1,method = 'nrc'): is the function used
  • ATTR([Text]): the input parameter of the function which is the tweet text

At this point I can see the score associated to every tweet, and since that R package uses dictionaries, I limited my research to tweets in english language (filtering on the column User.Lang.Clean mentioned above by en).

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The next step is to average the sentiment by character, seems an easy step but devil is in the details! Tableau takes the output of the SCRIPT_INT call to R as aggregated metric, thus not giving any visual options to re-aggregate! Plus the tweet Text field must be present in the layout for the sentiment to be calculated otherwise the metric results NULL.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Fortunately there are functions, and specifically window functions like WINDOW_AVG allowing a post aggregation based of a formula defining the start and end. The other cool fact is that window function work per partition of the data and the start and end of the window can be defined using the FIRST() and LAST() functions.

We can now create an aggregated version of our Sentiment column with the following formula

WINDOW_AVG(FLOAT([Sentiment]), FIRST(), LAST())  

This column will be repeated with the same value for all rows within the same "partition", in this case the character Name.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Be aware that this solution doesn't re-aggregate the data, we'll still see the data by single tweet Text and character Name. However the metric is calculated at total per character so graphs can be displayed.

I wanted to show a Scatter Plot based on the # of Mentions and Sentiment of each character. With the window functions and the defined above it's as easy as dragging the fields in the proper place and select the scatter plot viz.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

The default view is not very informative since I can't really associate a character to its position in the chart until I go over the related image. Fortunately Tableau allows the definition of custom shapes and I could easily assign character photos to related names.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

If negative mentions for Littlefinger and Cercei was somehow expected, the characters with most negative sentiment are Sansa Stark, probably due to the mysterious letter found by Arya in Baelish room, and Ellaria Sand. On the opposite side we strangely see the Night King and more in general the WhiteWalkers with a very positive sentiment associated to them. Strange, this needs further investigation.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Deep Dive on Whitewalkers and Sansa

I can create a view per Character with associate tweets and sentiment score and filter it for the WhiteWalkers. Looks like there are great expectations for this character in the next episodes (the battle is coming) which are associated with positive sentiments.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

When analysing the detail of the number of tweets falling in each sentiment score category it's clear why Sansa and Whitewalkers have such a different sentiment average. Both appear as normal distributions, but the center of the Whitewalkers curve is around 1 (positive sentiment) while for Sansa is between -1 and 0 (negative sentiment).

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

This explanation however doesn't give me enough information, and want to understand more about what are the most used words included in tweets mentioning WhiteWalkers or Night King.

Warning: the method mentioned above is not the optimal performance wise and should be used carefully since it causes data duplication if not handled properly.

There is no easy way to do so directly in Tableau, even using R since all the functions expect the output size to be 1-1 with the input, like sentiment score and text.
For this purpose I created a view on top of the BigQuery table directly in Tableau using the New Custom SQL option. The SQL used is the following

SELECT  ID, REPLACE(REPLACE(SPLIT(UPPER(TEXT),' '),'#',''),'@','')  word FROM [Dataset.rm_got]  

The SPLIT function divides the Text field in multiple rows one for every word separated by space. This is a very basic split and can of course be enhanced if needed. On top of it the SQL removes references to # and @. Since the view contains the tweet's Id field, this can be used to join this dataset with the main table.

The graph showing the overall words belonging to characters is not really helpful since the amount of words (even if I included only the ones with more than e chars) is too huge to be analysed properly.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

When analysing the single words in particular tweets I can clearly see that the Whitewalkers sentiment is driven by words like King, Iron, Throne having a positive sentiment. On the other hand Sansa stark is penalized by words like Kill and Fight probably due to the possible troubles with Arya.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

One thing to mention is that the word Stark is classified with a negative sentiment due to the general english dictionary used for the scoring. This affects all the tweets and in particular the average scores of all the characters belonging to the House Stark. A new "GoT" dictionary should be created and used in order to avoid those kind of misinterpretations.

Also when talking about "Game of Thrones", words like Kill or Death can have positive or negative meaning depending on the sentence, a imaginary tweet like

Finally Arya kills Cercei

Should have a positive sentiment for Arya and a negative for Cercei, but this is where automatic techniques of sentiment classification show their limits. Not even a new dictionary could help in this case.

The chart below shows the percentage of words classified with positive (score 1 or 2) or negative (score -1 or -2) for the two selected characters. We can clearly see that Sansa has more negative words than positive as expected while Whitewalkers is on the opposite side.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Furthermore the overall sentiment for the two characters may be explained by the following graph. This shows for every sentence sentiment category (divided in bins Positive, Neutral, Negative), an histogram based on the count of words by single word sentiment. We can clearly see how words with positive sentiment are driving the Positive sentence category (and the opposite).

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

Finally the last graph shows the words that have mostly impacted the overall positive and negative sentiment for both characters.

How was Game Of Thrones S07 E05? Tweet Analysis with Kafka, BigQuery and Tableau

We can clearly see that Sansa negative sentiment is due to Stark, Hate and Victim. On the other side Whitewalkers positive sentiment is due to words like King (Night King is the character) and Finally probably due to the battle coming in the next episode. As you can see there are also multiple instances of the King word due to different punctualization preceeding or following the world. I stated above that the BigQuery SQL extracting the words via the SPLIT function was very basic, we can now see why. Little enhancements in the function would aggregate properly the words.

Are you still there? Do you wonder what's left? Well there is a whole set of analysis that can be done on top of this dataset, including checking the sentiment behaviour by time during the live event or comparing this week's dataset with the next episode's one. The latter may happen next week so... Keep in touch!

Hope you enjoyed the analysis... otherwise... Dracarys!


Categories: BI & Warehousing

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

Mon, 2017-07-17 09:09
Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

Last week there was Wimbledon, if you are a fan of Federer, Nadal or Djokovic then it was one of the events not to be missed. I deliberately excluded Andy Murray from the list above since he kicked out my favourite player: Dustin Brown.

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

Two weeks ago I was at Kscope17 and one of the common themes, which reflected where the industry is going, was the usage of Kafka as central hub for all data pipelines. I wont go in detail on what's the specific role of Kafka and how it accomplishes, You can grab the idea from two slides taken from a recent presentation by Confluent.

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

One of the key points of all Kafka-related discussions at Kscope was that Kafka is widely used to take data from providers and push it to specific data-stores (like HDFS) that are then queried by analytical tools. However the "parking to data-store" step can sometimes be omitted with analytical tools querying directly Kafka for real-time analytics.

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

We wrote at the beginning of the year a blog post about doing it with Spark Streaming and Python however that setup was more data-scientist oriented and didn't provide a simple ANSI SQL familiar to the beloved end-users.

As usual, Oracle annouced a new release during Kscope. This year it was Oracle Data Visualization Desktop with a bunch of new features covered in my previous blog post.
The enhancement, amongst others, that made my day was the support for JDBC and ODBC drivers. It opened a whole bundle of opportunities to query tools not officially supported by DVD but that expose those type of connectors.

One of the tools that fits in this category is Presto, a distributed query engine belonging to the same family of Impala and Drill commonly referred as sql-on-Hadoop. A big plus of this tool, compared to the other two mentioned above, is that it queries natively Kafka via a dedicated connector.

I found then a way of fitting the two of the main Kscope17 topics, a new sql-on-Hadoop tool and one of my favourite sports (Tennis) in the same blog post: analysing real time Twitter Feeds with Kafka, Presto and Oracle DVD v3. Not bad as idea.... let's check if it works...

Analysing Twitter Feeds

Let's start from the actual fun: analysing the tweets! We can navigate to the Oracle Analytics Store and download some interesting add-ins we'll use: the Auto Refresh plugin that enables the refresh of the DV project, the Heat Map and Circle Pack visualizations and the Term Frequency advanced analytics pack.

Importing the plugin and new visualizations can be done directly in the console as explained in my previous post. In order to be able to use the advanced analytics function we need to unzip the related file and move the .xml file contained in the %INSTALL_DIR%\OracleBI1\bifoundation\advanced_analytics\script_repository. In the Advanced Analytics zip file there is also a .dva project that we can import into DVD (password Admin123) which gives us a hint on how to use the function.

We can now build a DVD Project about the Wimbledon gentleman singular final containing:

  • A table view showing the latest tweets
  • A horizontal bar chart showing the number of tweets containing mentions to Federer, Cilic or Both
  • A circle view showing the most tweeted terms
  • A heatmap showing tweet locations (only for tweets with an activated localization)
  • A line chart showing the number of tweets over time

The project is automatically refreshed using the auto-refresh plugin mentioned above. A quick view of the result is provided by the following image.

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

So far all good and simple! Now it's time to go back and check how the data is collected and queried. Let's start from Step #1: pushing Twitter data to Kafka!


We covered Kafka installation and setup in previous blog post, so I'll not repeat this part.
The only piece I want to mention, since gave me troubles, is the advertised.host.name setting: it's a configuration line in /opt/kafka*/config/server.properties that tells Kafka which is the host where it's listening.

If you leave the default localhost and try to push content to a topic from an external machine it will not show up, so as pre-requisite change it to a hostname/IP that can be resolved externally.

The rest of the Kafka setup is the creation of a Twitter producer, I took this Java project as example and changed it to use the latest Kafka release available in Maven. It allowed me to create a Kafka topic named rm.wimbledon storing tweets containing the word Wimbledon.

The same output could be achieved using Kafka Connect and its sink and source for twitter. Kafka Connect has also the benefit of being able to transform the data before landing it in Kafka making the data parsing easier and the storage faster to retrieve. I'll cover the usage of Kafka Connect in a future post, for more informations about it, check this presentation from Robin Moffatt of Confluent.

One final note about Kafka: I run a command to limit the retention to few minutes

bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic rm.wimbledon --config retention.ms=300000  

This limits the amount of data that is kept in Kafka, providing better performances during query time. This is not always possible in Kafka due to data collection needs and there are other ways of optimizing the query if necessary.

At this point of our project we have a dataflow from Twitter to Kafka, but no known way of querying it with DVD. It's time to introduce the query engine: Presto!


Presto was developed at Facebook, is in the family of sql-on-Hadoop tools. However, as per Apache Drill, it could be called sql-on-everything since data don't need to reside on an Hadoop system. Presto can query local file systems, MongoDB, Hive, and a big variety of datasources.

As the other sql-on-Hadoop technologies it works with always-on daemons which avoid the latency proper of Hive in starting a MapReduce job. Presto, differently from the others, divides the daemons in two types: the Coordinator and the Worker. A Coordinator is a node that receives the query from the clients, it analyses and plans the execution which is then passed on to Workers to carry out.

In other tools like Impala and Drill every node by default could add as both worker and receiver. The same can also happen in Presto but is not the default and the documentation suggest to dedicate a single machine to only perform coordination tasks for best performance in large cluster (reference to the doc).

The following image, taken from Presto website, explains the flow in case of usage of the Hive metastore as datasource.

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3


The default Presto installation procedure is pretty simple and can be found in the official documentation. We just need to download the presto-server-0.180.tar.gz tarball and unpack it.

tar -xvf presto-server-0.180.tar.gz  

This creates a folder named presto-server-0.180 which is the installation directory, the next step is to create a subfolder named etc which contains the configuration settings.

Then we need to create five configuration files and a folder within the etc folder:

  • node.environment: configuration specific to each node, enables the configuration of a cluster
  • jvm.config: options for the Java Virtual Machine
  • config.properties: specific coordinator/worker settings
  • log.properties: specifies log levels
  • catalog: a folder that will contain the data source definition

For a the basic functionality we need the following are the configurations:


With the environment parameter being shared across all the nodes in the cluster, the id being a unique identifier of the node and the data-dir the location where Presto will store logs and data.


I reduced the -Xmx parameter to 4GB as I'm running in a test VM. The parameters can of course be changed as needed.


Since we want to keep it simple we'll create a unique node acting both as coordinator and as worker, the related config file is:


Where the coordinator=true tells Presto to function as coordinator, http-server.http.port defines the ports, and discovery.uri is the URI to the Discovery server (in this case the same process).


We can keep the default INFO level, other levels are DEBUG, WARN and ERROR.


The last step in the configuration is the datasource setting: we need to create a folder named catalog within etc and create a file for each connection we intend to use.

For the purpose of this post we want to connect to the Kafka topic named rm.wimbledon. We need to create a file named kafka.properties within the catalog folder created above. The file contains the following lines


where kafka.nodes points to the Kafka brokers and kafka.table-names defines the list of topics delimited by a ,.

The last bit needed is to start the Presto server by executing

bin/launcher start  

We can append the --verbose parameter to debug the installation with logs that can be found in the var/log folder.

Presto Command Line Client

In order to query Presto via command line interface we just need to download the associated client (see official doc) which is in the form of a presto-cli-0.180-executable.jar file. We can now rename the file to presto and make it executable.

mv presto-cli-0.180-executable.jar presto  
chmod +x presto  

Then we can start the client by executing

./presto --server linuxsrv.local.com:8080 --catalog kafka --schema rm

Remember that the client has a JDK 1.8 as prerequisite, otherwise you will face an error. Once the client is successfully setup, we can start querying Kafka

You could notice that the schema (rm) we're connecting is just the prefix of the rm.wimbledon topic used in kafka. In this way I could potentially store other topics using the same rm prefix and being able to query them all together.

We can check which schemas can be used in Kafka with

presto:rm> show schemas;  
(2 rows)

We can also check which topics are contained in rm schema by executing

presto:rm> show tables;  
(1 row)

or change schema by executing

use information_schema;  

Going back to the Wimbledon example we can describe the content of the topic by executing

presto:rm> describe wimbledon;  
      Column       |  Type   | Extra |                   Comment                   
 _partition_id     | bigint  |       | Partition Id                                
 _partition_offset | bigint  |       | Offset for the message within the partition 
 _segment_start    | bigint  |       | Segment start offset                        
 _segment_end      | bigint  |       | Segment end offset                          
 _segment_count    | bigint  |       | Running message count per segment           
 _key              | varchar |       | Key text                                    
 _key_corrupt      | boolean |       | Key data is corrupt                         
 _key_length       | bigint  |       | Total number of key bytes                   
 _message          | varchar |       | Message text                                
 _message_corrupt  | boolean |       | Message data is corrupt                     
 _message_length   | bigint  |       | Total number of message bytes               
(11 rows)

We can immediately start querying it like

presto:rm> select count(*) from wimbledon;  
(1 row)

Query 20170713_102300_00023_5achx, FINISHED, 1 node  
Splits: 18 total, 18 done (100.00%)  
0:00 [27 rows, 195KB] [157 rows/s, 1.11MB/s]  

Remember all the queries are going against Kafka in real time, so the more messages we push, the more results we'll have available. Let's now check what the messages looks like

presto:rm> SELECT _message FROM wimbledon LIMIT 5;

 {"created_at":"Thu Jul 13 10:22:46 +0000 2017","id":885444381767081984,"id_str":"885444381767081984","text":"RT @paganrunes: Ian McKellen e Maggie Smith a Wimbl

 {"created_at":"Thu Jul 13 10:22:46 +0000 2017","id":885444381913882626,"id_str":"885444381913882626","text":"@tomasberdych spricht vor dem @Wimbledon-Halbfinal 

 {"created_at":"Thu Jul 13 10:22:47 +0000 2017","id":885444388645740548,"id_str":"885444388645740548","text":"RT @_JamieMac_: Sir Andrew Murray is NOT amused wit

 {"created_at":"Thu Jul 13 10:22:49 +0000 2017","id":885444394404503553,"id_str":"885444394404503553","text":"RT @IBM_UK_news: What does it take to be a #Wimbled

 {"created_at":"Thu Jul 13 10:22:50 +0000 2017","id":885444398929989632,"id_str":"885444398929989632","text":"RT @PakkaTollywood: Roger Federer Into Semifinals \

(5 rows)

As expected tweets are stored in JSON format, We can now use the Presto JSON functions to extract the relevant informations from it. In the following we're extracting the user.name part of every tweet. Node the LIMIT 10 (common among all the SQL-on-Hadoop technologies) to limit the number of rows returned.

presto:rm> SELECT json_extract_scalar(_message, '$.user.name') FROM wimbledon LIMIT 10;  
 pietre --           
 BLICK Sport         
 Hugh Leonard        
 ••••Teju KaLion•••• 
 Charlie Murray      
 The Daft Duck.      
 Raj Singh Chandel   
(10 rows)

We can also create summaries like the top 10 users by number of tweets.

presto:rm> SELECT json_extract_scalar(_message, '$.user.name') as screen_name, count(json_extract_scalar(_message, '$.id')) as nr FROM wimbledon GROUP BY json_extract_scalar(_message, '$.user.name') ORDER BY count(json_extract_scalar(_message, '$.id')) desc LIMIT 10;  
     screen_name     | nr  
 Evarie Balan        | 125 
 The Master Mind     | 104 
 Oracle Betting      |  98 
 Nichole             |  85 
 The K - Man         |  75 
 Kaciekulasekran     |  73 
 vientrainera        |  72 
 Deporte Esp         |  66 
 Lucas Mc Corquodale |  64 
 Amal                |  60 
(10 rows)
Adding a Description file

We saw above that it's possible to query with ANSI SQL statements using the Presto JSON function. The next step will be to define a structure on top of the data stored in the Kafka topic to turn raw data in a table format. We can achieve this by writing a topic description file. The file must be in json format and stored under the etc/kafka folder; it is recommended, but not necessary, that the name of the file matches the kafka topic (in our case rm.wimbledon). The file in our case would be the following

    "tableName": "wimbledon",
    "schemaName": "rm",
    "topicName": "rm.wimbledon",
    "key": {
        "dataFormat": "raw",
        "fields": [
                "name": "kafka_key",
                "dataFormat": "LONG",
                "type": "BIGINT",
                "hidden": "false"
    "message": {
        "dataFormat": "json",
        "fields": [
                "name": "created_at",
                "mapping": "created_at",
                "type": "TIMESTAMP",
                "dataFormat": "rfc2822"
                "name": "tweet_id",
                "mapping": "id",
                "type": "BIGINT"
                "name": "tweet_text",
                "mapping": "text",
                "type": "VARCHAR"
                "name": "user_id",
                "mapping": "user/id",
                "type": "VARCHAR"
                "name": "user_name",
                "mapping": "user/name",
                "type": "VARCHAR"

After restarting Presto when we execute the DESCRIBE operation we can see all the fields available.

presto:rm> describe wimbledon;  
      Column       |   Type    | Extra |                   Comment                   
 kafka_key         | bigint    |       |                                             
 created_at        | timestamp |       |                                             
 tweet_id          | bigint    |       |                                             
 tweet_text        | varchar   |       |                                             
 user_id           | varchar   |       |                                             
 user_name         | varchar   |       |                                             
 user_screenname   | varchar   |       |                                             
 user_location     | varchar   |       |                                             
 user_followers    | bigint    |       |                                             
 user_time_zone    | varchar   |       |                                             
 _partition_id     | bigint    |       | Partition Id                                
 _partition_offset | bigint    |       | Offset for the message within the partition 
 _segment_start    | bigint    |       | Segment start offset                        
 _segment_end      | bigint    |       | Segment end offset                          
 _segment_count    | bigint    |       | Running message count per segment           
 _key              | varchar   |       | Key text                                    
 _key_corrupt      | boolean   |       | Key data is corrupt                         
 _key_length       | bigint    |       | Total number of key bytes                   
 _message          | varchar   |       | Message text                                
 _message_corrupt  | boolean   |       | Message data is corrupt                     
 _message_length   | bigint    |       | Total number of message bytes               
(21 rows)

Now I can use the newly defined columns in my query

presto:rm> select created_at, user_name, tweet_text from wimbledon LIMIT 10;  

and the related results

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

We can always mix defined columns with custom JSON parsing Presto syntax if we need to extract some other fields.

select created_at, user_name, json_extract_scalar(_message, '$.user.default_profile') from wimbledon LIMIT 10;  
Oracle Data Visualization Desktop

As mentioned at the beginning of the article, the overall goal was to analyse Wimbledon twitter feed in real time with Oracle Data Visualization Desktop via JDBC, so let's complete the picture!

JDBC drivers

First step is to download the Presto JDBC drivers version 0.175, I found them in the Maven website. I tried also the 0.180 version downloadable directly from Presto website but I had several errors in the connection.
After downloading we need to copy the driver presto-jdbc-0.175.jar under the %INSTALL_DIR%\lib folder where %INSTALL_DIR% is the Oracle DVD installation folder and start DVD. Then I just need to create a new connection like the following

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

Note that:

  • URL: includes also the /kafka postfix, this tells Presto which storage I want to query
  • Driver Class Name: this setting puzzled me a little bit, I was able to discover the string (with the help of Gianni Ceresa) by concatenating the folder name and the driver class name after unpacking the jar file

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3

** Username/password: those strings can be anything since for the basic test we didn't setup any security on Presto.

The whole JDBC process setting is described in this youtube video provided by Oracle.

We can then define the source by just selecting the columns we want to import and create few additional ones like the Lat and Long parsing from the coordinates column which is in the form [Lat, Long]. The dataset is now ready to be analysed as we saw at the beginning of the article, with the final result being:

Analyzing Wimbledon Twitter Feeds in Real Time with Kafka, Presto and Oracle DVD v3


As we can see from the above picture the whole process works (phew....), however it has some limitations: there is no pushdown of functions to the source so most of the queries we see against Presto are in the form of

select tweet_text, tweet_id, user_name, created_at from (  
select coordinates,  
from rm.wimbledon)  

This means that the whole dataset is retrieved every time making this solution far from optimal for big volumes of data. In those cases probably the "parking" to datastore step would be necessary. Another limitation is related to the transformations, the Lat and Long extractions from coordinates field along with other columns transformations are done directly in DVD, meaning that the formula is applied directly in the visualization phase. In the second post we'll see how the source parsing phase and query performances can be enhanced using Kafka Connect, the framework allowing an easy integration between Kafka and other sources or sinks.

One last word: winning Wimbledon eight times, fourteen years after the first victory and five years after the last one it's something impressive! Chapeau mr Federer!

Categories: BI & Warehousing

Streaming Global Cyber Attack Analytics with Tableau and Python

Tue, 2017-07-11 10:19
Streaming Global Cyber Attack Analytics with Tableau and Python

Streaming Global Cyber Attack Analytics with Tableau and Python

Introduction and Hacks

As grandiose a notion as the title may imply, there have been some really promising and powerful moves made in the advancement of smoothly integrating real-time and/or streaming data technologies into most any enterprise reporting and analytics architecture. When used in tandem with functional programming languages like Python, we now have the ability to create enterprise grade data engineering scripts to handle the manipulation and flow of data, large or small, for final consumption in all manner of business applications.

In this cavalcade of coding, we're going to use a combination of Satori, a free data streaming client, and python to stream live world cyber attack activity via an api. We'll consume the records as json, and then use a few choice python libraries to parse, normalize, and insert the records into a mysql database. Finally, we'll hook it all up to Tableau and watch cyber attacks happen in real time with a really cool visualization.

The Specs

For the this exercise, we're going to bite things off a chunk at a time. We're going to utilize a service called Satori, a streaming data source aggregator that will make it easy for us to hook up to any number of streams to work with as we please. In this case, we'll be working with the Live Cyber Attack Threat Map data set. Next, we'll set up our producer code that will do a couple of things. First it will create the API client from which we will be ingesting a constant flow of cyber attack records. Next, we'll take these records and convert them to a data frame using the Pandas library for python. Finally, we will insert them into a MySQL database. This will allow us to use this live feed as a source for Tableau in order to create a geo mapping of countries that are currently being targeted by cyber attacks.

The Data Source

Streaming Global Cyber Attack Analytics with Tableau and Python

Satori is a new-ish service that aggregates the web's streaming data sources and provides developers with a client and some sample code that they can then use to set up their own live data streams. While your interests may lie in how you can stream your own company's data, it then simply becomes a matter of using python's requests library to get at whatever internal sources you might need. Find more on the requests library here.

Satori has taken a lot of the guess work out of the first step of the process for us, as they provide basic code samples in a number of popular languages to access their streaming service and to generate records. You can find the link to this code in a number of popular languages here. Note that you'll need to install their client and get your own app key. I've added a bit of code at the end to handle the insertion of records, and to continue the flow, should any records produce a warning.

Satori Code
# Imports
from __future__ import print_function

import sys  
import threading  
from pandas import DataFrame  
from satori.rtm.client import make_client, SubscriptionMode

# Local Imports
from create_table import engine

# Satori Variables
channel = "live-cyber-attack-threat-map"  
endpoint = "wss://open-data.api.satori.com"  
appkey = " "

# Local Variables
table = 'hack_attacks'

def main():

    with make_client(
            endpoint=endpoint, appkey=appkey) as client:


        mailbox = []
        got_message_event = threading.Event()

        class SubscriptionObserver(object):
            def on_subscription_data(self, data):
                for message in data['messages']:

        subscription_observer = SubscriptionObserver()

        if not got_message_event.wait(30):
            print("Timeout while waiting for a message")

        for message in mailbox:
                # Create dataframe
                data = DataFrame([message],
                                 columns=['attack_type', 'attacker_ip', 'attack_port',
                                          'latitude2', 'longitude2', 'longitude',
                                          'city_target', 'country_target', 'attack_subtype',
                                          'latitude', 'city_origin', 'country_origin'])
                # Insert records to table
                    data.to_sql(table, engine, if_exists='append')

                except Exception as e:

if __name__ == '__main__':  

Creating a Table

Now that we've set up the streaming code that we'll use to fill our table, we'll need to set up the table in MySQL to hold them all. For this we'll use the SQLAlchemy ORM (object relational mapper). It's a high falutin' term for a tool that simply abstracts SQL commands to be more 'pythonic'; that is, you don't necessarily have to be a SQL expert to create tables in your given database. Admittedly, it can be a bit daunting to get the hang of, but give it a shot. Many developers choose to interact a with a given database either via direct SQL or using an ORM. It's good practice to use a separate python file, in this case settings.py (or some variation thereof), to hold your database connection string in the following format (the addition of the mysqldb tag at the beginning is as a result of the installation of the mysql library you'll need for python), entitled SQLALCHEMY_DATABASE_URI:


Don't forget to sign in to your database to validate success!

Feeding MySQL and Tableau

Now all we need to do is turn on the hose and watch our table fill up. Running producer.py, we can then open a new tab, log in to our database to make sure our table is being populated, and go to work. Create a new connection to your MySQL database (called my db 'hacks') in Tableau and verify that everything is in order once you navigate to the data preview. There are lots of nulls in this data set, but this will simply be a matter of filtering them out on the front end.

Streaming Global Cyber Attack Analytics with Tableau and Python

Tableau should pick up right away on the geo data in the dataset, as denoted by the little globe icon next to the field.
Streaming Global Cyber Attack Analytics with Tableau and Python We can now simply double-click on the corresponding geo data field, in this case we'll be using Country Target, and then the Number of Records field in the Measures area.
Streaming Global Cyber Attack Analytics with Tableau and Python I've chosen to use the 'Dark' map theme for this example as it just really jives with the whole cyber attack, international espionage vibe. Note that you'll need to maintain a live connection, via Tableau, to your datasource and refresh at the interval you'd like, if using Tableau Desktop. If you're curious about how to automagically provide for this functionality, a quick google search will come up with some solutions.

Categories: BI & Warehousing