BI & Warehousing

Rittman Mead at UKOUG 2017

Rittman Mead Consulting - Mon, 2017-12-04 02:58

For those of you attending the UKOUG this year, we are giving three presentations on OBIEE and Data Visualisation.

Francesco Tisiot has two on Monday:

  • 14.25 // Enabling Self-Service Analytics With Analytic Views & Data Visualization From Cloud to Desktop - Hall 7a
  • 17:55 // OBIEE: Going Down the Rabbit Hole - Hall 7a

Federico Venturin is giving his culinary advice on Wednesday:

  • 11:25 // Visualising Data Like a Top Chef - Hall 6a

And Mike Vickers is diving into BI Publisher, also on Wednesday

  • 15:15 // BI Publisher: Teaching Old Dogs Some New Tricks - Hall 6a

In addition, Sam Jeremiah and I are also around, so if anyone wants to catch up, grab us for a coffee or a beer.

Categories: BI & Warehousing

The Biggest Change to Reporting & Analysis in 2018 Won’t Be the Cloud

Look Smarter Than You Are - Wed, 2017-11-29 12:43
Screenshot from

Companies spent most of 2017 either preparing their journey to the Cloud, getting started on moving their applications to the Cloud, or hoping the whole Cloud thing would go away if we just ignored it long enough (like my late fees at Blockbuster). But in the end, the Cloud isn’t revolutionary: the Cloud just means someone else is managing your server for you. While it’s nice that your servers are now someone else’s problem, there is an actual revolution happening in reporting & analysis and it’s a technology that’s been around for decades.
The Future of Reporting & Analysis Can Also Take Selfies
Up to this point, mobile has been an afterthought in the world of reporting & analysis: we design for a laptop first and if something ends up mobile-enabled, that’s a nice-to-have. The commonly held belief is that mobile devices (phones, tablets) are too small of a footprint to show formatted reports or intricate dashboards. That belief is correct in the same way that Microsoft Outlook is way too complex of an application to make reading emails on a mobile device practical… except that most emails in the world are now read on a mobile device. They’re just not using Outlook. We had to rethink of a smaller, faster, easier, more intuitive (sorry, Microsoft) way of consuming information to take email mobile.

Reporting & analysis will also hit that tipping point in 2018 where we ask ourselves simply “what questions do I need answered to make better business decisions faster?” and then our phones will give us exactly that without all the detail a typical report or dashboard provides. Will mobile analytics kill off desktop applications? No more than the desktop killed off paper reports. They all have their place: paper reports are good for quickly looking at a large amount of formatted information, desktops will be good for details (Excel will live on for the foreseeable future), and mobile will take its rightful place as the dominant form of information consumption.
Forget the Past and Pay Attention to the Present
The greatest thing about mobile is that everyone has their phone less than six feet from them at all times [you just glanced over at yours to see if I’m right]. But would you ever look at your phone if your screen took a month to update? Traditional reports are very backwards-looking. Your typical Income Statement, for instance, tells you how you spent the last year, it sometimes tells you about the upcoming forecast, but it rarely tells you, “am I making money at this moment?” Just like the dashboard of a car would be awfully useless if it gave you last month’s average gas tank reading – hey, I was 75% full in December! – mobile reports won’t be for looking at historically dated information. Instead, we’ll look to mobile to give us just the information we need to take physical actions now.
But Why is 2018 the Year of Mobile Analytics?
Quite simply, we didn’t have the technology to support our decisions until now. While we could take reports or dashboards and interact with them on mobile devices, we don’t want to actually perform analytics on our phones. We want the computers doing the analysis for us. While we’ve had data mining for years, it was relegated to high-priced data scientists or not-so-highly-paid analysts.

We now have artificial intelligence that can look through our data 24/7 and with no guidance from us, determine what drivers correlate with which results. Machine learning can then determine which information it delivers do we truly find useful. And so we don’t have to dig through all the results to find out what the system is trying to tell us, the mobile analytics apps in 2018 will convert complex information into natural language. It will simply tell us in plain English (or your language of choice), “I looked through all your information and here are the things you need to be aware of right now.”

While that may seem like distant promises to many people, it’s here now. At Oracle’s OpenWorld 2017 conference, there was an amazing demonstration of everything I mentioned in the last paragraph. The audience was even more amazed when told that all that functionality would be in Oracle Analytics Cloud before OpenWorld 2018. I’m sure the employees of Microsoft, Tableau, QlikView, and others are either busy working on their own technological magic or they’re busier working on their resumés.
Am I Ready for the Future?
Start finding out at EPM.BI/Survey. Each year, I conduct a global survey of Business Analytics. Last year, I asked over 250 companies how they were doing in the world of reporting, analysis, planning, and consolidation.  To participate in this year’s survey, go to EPM.BI/Survey and spend 15 minutes answering questions about your State of Business Analytics that you maybe haven’t thought of in years. In exchange for filling in the survey, you’ll be invited to a webcast on January 31, 2018, at 1PM Eastern, where you’ll learn how your BI & EPM (Business Intelligence & Enterprise Performance Management) stacks up against the rest of the world.

If you have any questions, ask them in the comments or tweet them to me @ERoske.
Categories: BI & Warehousing

Garbled display while running FMW installer on Linux

Amardeep Sidhu - Sat, 2017-11-18 04:56

A colleague faced this while running FMW installer on a Linux machine. The display appeared like this








This thread gave a clue that it could have something to do with fonts. So I checked what all fonts related stuff was installed.

[root@someserver ~]# rpm -aq |grep -i font
[root@someserver ~]#

stix-fonts looked suspicious to me. So I removed that with rpm -e stix-fonts.

That actually fixed the issue. After this the Installer window was displaying fine.


Categories: BI & Warehousing fails with CRS-2101:The OLR was formatted using version 3

Amardeep Sidhu - Sat, 2017-11-18 04:33

Got this while trying to install RAC on Redhat Linux 7.2. fails with a message like

ohasd failed to start
Failed to start the Clusterware. Last 20 lines of the alert log follow:
2017-11-09 15:43:37.883:
[client(37246)]CRS-2101:The OLR was formatted using version 3.

This is bug 18370031. Need to apply the patch before running

Categories: BI & Warehousing

Taking KSQL for a Spin Using Real-time Device Data

Rittman Mead Consulting - Tue, 2017-11-07 06:41
Taking KSQL for a Spin Using Real-time Device Data Taking KSQL for a Spin Using Real-time Device Data

Evaluating KSQL has been high on my to-do list ever since it was released back in August. I wanted to experiment with it using an interesting, high velocity, real-time data stream that would allow me to analyse events at the millisecond level, rather than seconds or minutes. Finding such a data source, that is free of charge and not the de facto twitter stream, is tricky. So, after some pondering, I decided that I'd use my Thrustmaster T300RS Steering Wheel/Pedal Set gaming device as a data source,

Taking KSQL for a Spin Using Real-time Device Data

The idea being that the data would be fed into Kafka, processed in real-time using KSQL and visualised in Grafana.

This is the end to end pipeline that I created...

Taking KSQL for a Spin Using Real-time Device Data

...and this is the resulting real-time dashboard running alongside a driving game and a log of the messages being sent by the device.

This article will explain how the above real-time dashboard was built using only KSQL...and a custom Kafka producer.

I'd like to point out, that although the device I'm using for testing is unconventional, when considered in the wider context of IoT's, autonomous driving, smart automotives or any device for that matter, it will be clear to see that the low latency, high throughput of Apache Kafka, coupled with Confluent's KSQL, can be a powerful combination.

I'd also like to point out, that this article is not about driving techniques, driving games or telemetry analysis. However, seeing as the data source I'm using is intrinsically tied to those subjects, the concepts will be discussed to add context. I hope you like motorsports!

Writing a Kafka Producer for a T300RS

The T300RS is attached to my Windows PC via a USB cable, so the first challenge was to try and figure out how I could get steering, braking and accelerator inputs pushed to Kafka. Unsurprisingly, a source connector for a "T300RS Steering Wheel and Pedal Set" was not listed on the Kafka Connect web page - a custom producer was the only option.

To access the data being generated by the T300RS, I had 2 options, I could either use an existing Telemetry API from one of my racing games, or I could access it directly using the Windows DirectX API. I didn't want to have to have a game running in the background in order to generate data, so I decided to go down the DirectX route. This way, the data is raw and available, with or without an actual game engine running.

The producer was written using the SharpDX .NET wrapper and Confluent's .NET Kafka Client. The SharpDX directinput API allows you to poll an attached input device (mouse, keyboard, game controllers etc.) and read its buffered data. The buffered data returned within each polling loop is serialized into JSON and sent to Kafka using the .NET Kafka Client library.

A single message is sent to a topic in Kafka called raw_axis_inputs every time the state of one the device's axes changes. The device has several axes, in this article I am only interested in the Wheel, Accelerator, Brake and the X button.

    "event_id":4300415,         // Event ID unique over all axis state changes
    "timestamp":1508607521324,  // The time of the event
    "axis":"Y",                 // The axis this event belongs to
    "value":32873.0             // the current value of the axis

This is what a single message looks like. In the above message the Brake axis state was changed, i.e. it moved to a new position with value 32873.

You can see below which inputs map to the each reported axis from the device.

Taking KSQL for a Spin Using Real-time Device Data

Here is a sample from the producer's log file.


You can tell by looking at the timestamps, it's possible to have multiple events generated within the same millisecond, I was unable to get microsecond precision from the device unfortunately. When axes, "X", "Y" and "RotationZ" are being moved quickly at the same time (a bit like a child driving one of those coin operated car rides you find at the seaside) the device generates approximately 500 events per second.

Creating a Source Stream

Now that we have data streaming to Kafka from the device, it's time to fire up KSQL and start analysing it. The first thing we need to do is create a source stream. The saying "Every River Starts with a Single Drop" is quite fitting here, especially in the context of stream processing. The raw_axis_inputs topic is our "Single Drop" and we need to create a KSQL stream based on top of it.

CREATE STREAM raw_axis_inputs ( \  
     event_id BIGINT, \
     timestamp BIGINT, \
     axis VARCHAR, \
     value DOUBLE ) \
 WITH (kafka_topic = 'raw_axis_inputs', value_format = 'JSON');

With the stream created we can we can now query it. I'm using the default auto.offset.reset = latest as I have the luxury of being able to blip the accelerator whenever I want to generate new data, a satisfying feeling indeed.

ksql> SELECT * FROM raw_axis_inputs;  
1508693510267 | null | 4480290 | 1508693510263 | RotationZ | 65278.0  
1508693510269 | null | 4480291 | 1508693510263 | RotationZ | 64893.0  
1508693510271 | null | 4480292 | 1508693510263 | RotationZ | 63993.0  
1508693510273 | null | 4480293 | 1508693510263 | RotationZ | 63094.0  
1508693510275 | null | 4480294 | 1508693510279 | RotationZ | 61873.0  
1508693510277 | null | 4480295 | 1508693510279 | RotationZ | 60716.0  
1508693510279 | null | 4480296 | 1508693510279 | RotationZ | 60267.0  
Derived Streams

We now have our source stream created and can start creating some derived streams from it. The first derived stream we are going to create filters out 1 event. When the X button is pressed it emits a value of 128, when it's released it emits a value of 0.

Taking KSQL for a Spin Using Real-time Device Data

To simplify this input, I'm filtering out the release event. We'll see what the X button is used for later in the article.

CREATE STREAM axis_inputs WITH (kafka_topic = 'axis_inputs') AS \  
SELECT  event_id, \  
        timestamp, \
        axis, \
        value \
FROM    raw_axis_inputs \  
WHERE   axis != 'Buttons5' OR value != 0.0;  

From this stream we are going to create 3 further streams, one for the brake, one the accelerator and one for the wheel.

All 3 axes emit values in the range of 0-65535 across their full range. The wheel emits a value of 0 when rotated fully left, a value of 65535 when rotated fully right and 32767 when dead centre. The wheel itself is configured to rotate 900 degrees lock-to-lock, so it would be nice to report its last state change in degrees, rather than from a predetermined integer range. For this we can create a new stream, that includes only messages where the axis = 'X', and the axis values are translated into the range of -450 degrees to 450 degrees. With this new value translation, maximum rotation left now equates to 450 degrees and maximum rotation right equates -450 degrees, 0 is now dead centre.

CREATE STREAM steering_inputs WITH (kafka_topic = 'steering_inputs') AS \  
  SELECT  axis, \
          event_id, \
          timestamp, \
          (value / (65535.0 / 900.0) - 900 / 2) * -1 as value \
  FROM    axis_inputs \
  WHERE   axis = 'X';

If we now query our new stream and move the wheel slowly around dead centre, we get the following results

ksql> select timestamp, value from steering_inputs;

1508711287451 | 0.6388888888889142  
1508711287451 | 0.4305555555555429  
1508711287451 | 0.36111111111108585  
1508711287451 | 0.13888888888891415  
1508711287451 | -0.0  
1508711287467 | -0.041666666666685614  
1508711287467 | -0.26388888888891415  
1508711287467 | -0.3333333333333144  
1508711287467 | -0.5277777777777715  
1508711287467 | -0.5972222222222285  

The same query while the wheel is rotated fully left

1508748345943 | 449.17601281757845  
1508748345943 | 449.3270771343557  
1508748345943 | 449.5330739299611  
1508748345943 | 449.67040512703136  
1508748345959 | 449.8214694438087  
1508748345959 | 449.95880064087896  
1508748345959 | 450.0  

And finally, rotated fully right.

1508748312803 | -449.3408102540627  
1508748312803 | -449.4369420920119  
1508748312818 | -449.67040512703136  
1508748312818 | -449.7390707255665  
1508748312818 | -449.9725337605859  
1508748312818 | -450.0  

Here's the data plotted in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

We now need to create 2 more derived streams to handle the accelerator and the brake pedals. This time, we want to translate the values to the range 0-100. When a pedal is fully depressed it should report a value of 100 and when fully released, a value of 0.

CREATE STREAM accelerator_inputs WITH (kafka_topic = 'accelerator_inputs') AS \  
SELECT  axis, \  
        event_id, \
        timestamp, \
        100 - (value / (65535.0 / 100.0)) as value \
FROM    axis_inputs \  
WHERE   axis = 'RotationZ';  

Querying the accelerator_inputs stream while fully depressing the accelerator pedal displays the following. (I've omitted many records in the middle to keep it short)

ksql> SELECT timestamp, value FROM accelerator_inputs;  
1508749747115 | 0.0  
1508749747162 | 0.14198473282442592  
1508749747193 | 0.24122137404580712  
1508749747209 | 0.43664122137404604  
1508749747225 | 0.5343511450381726  
1508749747287 | 0.6335877862595396  
1508749747318 | 0.7312977099236662  
1508749747318 | 0.8290076335877927  
1508749747334 | 0.9267175572519051  
1508749747381 | 1.0259541984732863  
1508749753943 | 98.92519083969465  
1508749753959 | 99.02290076335878  
1508749753959 | 99.1206106870229  
1508749753959 | 99.21832061068702  
1508749753975 | 99.31603053435114  
1508749753975 | 99.41374045801527  
1508749753975 | 99.5114503816794  
1508749753990 | 99.60916030534351  
1508749753990 | 99.70687022900763  
1508749753990 | 99.80458015267176  
1508749754006 | 100.0

...and displayed in Grafana

Taking KSQL for a Spin Using Real-time Device Data

Finally, we create the brake stream, which has the same value translation as the accelerator stream, so I won't show the query results this time around.

CREATE STREAM brake_inputs WITH (kafka_topic = 'brake_inputs') AS \  
SELECT  axis, \  
        event_id, \
        timestamp, \
        100 - (value / (65535 / 100)) as value \
FROM    axis_inputs \  
WHERE   axis = 'Y';  

Braking inputs in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

Smooth is Fast

It is a general rule of thumb in motorsports that "Smooth is Fast", the theory being that the less steering, accelerator and braking inputs you can make while still keeping the car on the desired racing line, results in a faster lap time. We can use KSQL to count the number of inputs for each axis over a Hopping Window to try and capture overall smoothness. To do this, we create our first KSQL table.

CREATE TABLE axis_events_hopping_5s_1s \  
WITH (kafka_topic = 'axis_events_hopping_5s_1s') AS \  
SELECT  axis, \  
        COUNT(*) AS event_count \
FROM    axis_inputs \  
GROUP BY axis;  

A KSQL table is basically a view over an existing stream or another table. When a table is created from a stream, it needs to contain an aggregate function and group by clause. It's these aggregates that make a table stateful, with the underpinning stream updating the table's current view in the background. If you create a table based on another table you do not need to specify an aggregate function or group by clause.

The table we created above specifies that data is aggregated over a Hopping Window. The size of the window is 5 seconds and it will advance or hop every 1 second. This means that at any one time, there will be 5 open windows, with new data being directed to each window based on the key and the record's timestamp.

You can see below when we query the stream, that we have 5 open windows per axis, with each window 1 second apart.

ksql> SELECT * FROM axis_events_hopping_5s_1s;  
1508758267000 | X : Window{start=1508758267000 end=-} | X | 56  
1508758268000 | X : Window{start=1508758268000 end=-} | X | 56  
1508758269000 | X : Window{start=1508758269000 end=-} | X | 56  
1508758270000 | X : Window{start=1508758270000 end=-} | X | 56  
1508758271000 | X : Window{start=1508758271000 end=-} | X | 43  
1508758267000 | Y : Window{start=1508758267000 end=-} | Y | 25  
1508758268000 | Y : Window{start=1508758268000 end=-} | Y | 25  
1508758269000 | Y : Window{start=1508758269000 end=-} | Y | 25  
1508758270000 | Y : Window{start=1508758270000 end=-} | Y | 32  
1508758271000 | Y : Window{start=1508758271000 end=-} | Y | 32  
1508758267000 | RotationZ : Window{start=1508758267000 end=-} | RotationZ | 67  
1508758268000 | RotationZ : Window{start=1508758268000 end=-} | RotationZ | 67  
1508758269000 | RotationZ : Window{start=1508758269000 end=-} | RotationZ | 67  
1508758270000 | RotationZ : Window{start=1508758270000 end=-} | RotationZ | 67  
1508758271000 | RotationZ : Window{start=1508758271000 end=-} | RotationZ | 39  

This data is going to be pushed into InfluxDB and therefore needs a timestamp column. We can create a new table for this, that includes all columns from our current table, plus the rowtime.

CREATE TABLE axis_events_hopping_5s_1s_ts \  
WITH (kafka_topic = 'axis_events_hopping_5s_1s_ts') AS \  
SELECT  rowtime AS timestamp, * \  
FROM    axis_events_hopping_5s_1s;  

And now, when we query this table we can see we have all the columns we need.

ksql> select timestamp, axis, event_count from axis_events_hopping_5s_1s_ts;  
1508761027000 | RotationZ | 61  
1508761028000 | RotationZ | 61  
1508761029000 | RotationZ | 61  
1508761030000 | RotationZ | 61  
1508761031000 | RotationZ | 61  
1508761028000 | Y | 47  
1508761029000 | Y | 47  
1508761030000 | Y | 47  
1508761031000 | Y | 47  
1508761032000 | Y | 47  
1508761029000 | X | 106  
1508761030000 | X | 106  
1508761031000 | X | 106  
1508761032000 | X | 106  
1508761033000 | X | 106  

This is the resulting graph in Grafana with each axis stacked on top of each other giving a visual representation of the total number of events overall and total per axis. The idea here being that if you can drive a lap with less overall inputs or events then the lap time should be faster.

Taking KSQL for a Spin Using Real-time Device Data

Calculating Lap Times

To calculate lap times, I needed a way of capturing the time difference between 2 separate events in a stream. Remember that the raw data is coming directly from the device and has no concept of lap, lap data is handled by a game engine.
I needed a way to inject an event into the stream when I crossed the start/finish line of any given race track. To achieve this, I modified the custom producer to increment a counter every time the X button was pressed and added a new field to the JSON message called lap_number.

Taking KSQL for a Spin Using Real-time Device Data

I then needed to recreate my source stream and my initial derived stream to include this new field

New source stream

CREATE STREAM raw_axis_inputs ( \  
     event_id BIGINT, \
     timestamp BIGINT, \
     lap_number BIGINT, \
     axis VARCHAR, \
     value DOUBLE ) \
 WITH (kafka_topic = 'raw_axis_inputs', value_format = 'JSON');

New derived stream.

CREATE STREAM axis_inputs WITH (kafka_topic = 'axis_inputs') AS \  
SELECT  event_id, \  
        timestamp, \
        lap_number, \
        axis, \
        value \
FROM    raw_axis_inputs \  
WHERE   axis != 'Buttons5' OR value != 0.0;  

Now when I query the axis_inputs stream and press the X button a few times we can see an incrementing lap number.

ksql> SELECT timestamp, lap_number, axis, value FROM axis_inputs;  
1508762511506 | 6 | X | 32906.0  
1508762511553 | 6 | X | 32907.0  
1508762511803 | 6 | X | 32909.0  
1508762512662 | 7 | Buttons5 | 128.0  
1508762513178 | 7 | X | 32911.0  
1508762513256 | 7 | X | 32913.0  
1508762513318 | 7 | X | 32914.0  
1508762513381 | 7 | X | 32916.0  
1508762513459 | 7 | X | 32918.0  
1508762513693 | 7 | X | 32919.0  
1508762514584 | 8 | Buttons5 | 128.0  
1508762515021 | 8 | X | 32921.0  
1508762515100 | 8 | X | 32923.0  
1508762515209 | 8 | X | 32925.0  
1508762515318 | 8 | X | 32926.0  
1508762515678 | 8 | X | 32928.0  
1508762516756 | 8 | X | 32926.0  
1508762517709 | 9 | Buttons5 | 128.0  
1508762517756 | 9 | X | 32925.0  
1508762520381 | 9 | X | 32923.0  
1508762520709 | 9 | X | 32921.0  
1508762520881 | 10 | Buttons5 | 128.0  
1508762521396 | 10 | X | 32919.0  
1508762521568 | 10 | X | 32918.0  
1508762521693 | 10 | X | 32916.0  
1508762521803 | 10 | X | 32914.0  

The next step is to calculate the time difference between each "Buttons5" event (the X button). This required 2 new tables. The first table below captures the latest values using the MAX() function from the axis_inputs stream where the axis = 'Buttons5'

CREATE TABLE lap_marker_data WITH (kafka_topic = 'lap_marker_data') AS \  
SELECT  axis, \  
        MAX(event_id) AS lap_start_event_id, \
        MAX(timestamp) AS lap_start_timestamp, \ 
        MAX(lap_number) AS lap_number \
FROM    axis_inputs \  
WHERE   axis = 'Buttons5' \  
GROUP BY axis;  

When we query this table, a new row is displayed every time the X button is pressed, reflecting the latest values from the stream.

ksql> SELECT axis, lap_start_event_id, lap_start_timestamp, lap_number FROM lap_marker_data;  
Buttons5 | 4692691 | 1508763302396 | 15  
Buttons5 | 4693352 | 1508763306271 | 16  
Buttons5 | 4693819 | 1508763310037 | 17  
Buttons5 | 4693825 | 1508763313865 | 18  
Buttons5 | 4694397 | 1508763317209 | 19  

What we can now do is join this table to a new stream.

CREATE STREAM lap_stats WITH (kafka_topic = 'lap_stats') AS \  
SELECT  l.lap_number as lap_number, \  
        l.lap_start_event_id, \
        l.lap_start_timestamp, \
        a.timestamp AS lap_end_timestamp, \
        (a.event_id - l.lap_start_event_id) AS lap_events, \
        (a.timestamp - l.lap_start_timestamp) AS laptime_ms \
FROM       axis_inputs a LEFT JOIN lap_marker_data l ON a.axis = l.axis \  
WHERE   a.axis = 'Buttons5';    

Stream created

ksql> describe lap_stats;

 Field               | Type
 ROWTIME             | BIGINT
 ROWKEY              | VARCHAR (STRING)

This new stream is again based on the axis_inputs stream where the axis = 'Buttons5'. We are joining it to our lap_marker_data table which results in a stream where every row includes the current and previous values at the point in time when the X button was pressed.

A quick query should illustrate this (I've manually added column heading to make it easier to read)

ksql> SELECT lap_number, lap_start_event_id, lap_start_timestamp, lap_end_timestamp, lap_events, laptime_ms FROM lap_stats;

36 | 4708512 | 1508764549240 | 1508764553912 | 340   | 4672  
37 | 4708852 | 1508764553912 | 1508764567521 | 1262  | 13609  
38 | 4710114 | 1508764567521 | 1508764572162 | 1174  | 4641  
39 | 4711288 | 1508764572162 | 1508764577865 | 1459  | 5703  
40 | 4712747 | 1508764577865 | 1508764583725 | 939   | 5860  
41 | 4713686 | 1508764583725 | 1508764593475 | 2192  | 9750  
42 | 4715878 | 1508764593475 | 1508764602318 | 1928  | 8843

We can now see the time difference, in milliseconds ( LAP_TIME_MS ), between each press of the X button. This data can now be displayed in Grafana.

Taking KSQL for a Spin Using Real-time Device Data

The data is also being displayed along the top of the dashboard, aligned above the other graphs, as a ticker to help visualize lap boundaries across all axes.

Taking KSQL for a Spin Using Real-time Device Data

Anomaly Detection

A common use case when performing real-time stream analytics is Anomaly Detection, the act of detecting unexpected events, or outliers, in a stream of incoming data. Let's see what we can do with KSQL in this regard.

Driving Like a Lunatic?

As mentioned previously, Smooth is Fast, so it would be nice to be able to detect some form of erratic driving. When a car oversteers, the rear end of the car starts to rotate around a corner faster than you'd like, to counteract this motion, quick steering inputs are required to correct it. On a smooth lap you will only need a small part of the total range of the steering wheel to safely navigate all corners, when you start oversteering you will need make quick, but wider use of the total range of the wheel to keep the car on the track and prevent crashing.

To try and detect oversteer we need to create another KSQL table, this time based on the steering_inputs stream. This table counts steering events across a very short hopping window. Events are counted only if the rotation exceeds 180 degrees (sharp left rotation) or is less than -180 degrees (sharp right rotation)

CREATE TABLE oversteer WITH (kafka_topic = 'oversteer') AS \  
SELECT  axis, \  
        COUNT(*) \
FROM    steering_inputs \  
WHERE   value > 180 or value < -180 \  
GROUP by axis;  

We now create another table that includes the timestamp for InfluxDB.

CREATE TABLE oversteer_ts WITH (kafka_topic = 'oversteer_ts') AS \  
SELECT rowtime AS timestamp, * \  
FROM oversteer;  

If we query this table, while quickly rotating the wheel in the range value > 180 or value < -180, we can see multiple windows, 10ms apart, with a corresponding count of events.

ksql> SELECT * FROM oversteer_ts;  
1508767479920 | X : Window{start=1508767479920 end=-} | 1508767479920 | X | 5  
1508767479930 | X : Window{start=1508767479930 end=-} | 1508767479930 | X | 10  
1508767479940 | X : Window{start=1508767479940 end=-} | 1508767479940 | X | 15  
1508767479950 | X : Window{start=1508767479950 end=-} | 1508767479950 | X | 20  
1508767479960 | X : Window{start=1508767479960 end=-} | 1508767479960 | X | 25  
1508767479970 | X : Window{start=1508767479970 end=-} | 1508767479970 | X | 30  
1508767479980 | X : Window{start=1508767479980 end=-} | 1508767479980 | X | 35  
1508767479990 | X : Window{start=1508767479990 end=-} | 1508767479990 | X | 40  
1508767480000 | X : Window{start=1508767480000 end=-} | 1508767480000 | X | 45  
1508767480010 | X : Window{start=1508767480010 end=-} | 1508767480010 | X | 50  
1508767480020 | X : Window{start=1508767480020 end=-} | 1508767480020 | X | 50  
1508767480030 | X : Window{start=1508767480030 end=-} | 1508767480030 | X | 50  
1508767480040 | X : Window{start=1508767480040 end=-} | 1508767480040 | X | 50  
1508767480050 | X : Window{start=1508767480050 end=-} | 1508767480050 | X | 50  
1508767480060 | X : Window{start=1508767480060 end=-} | 1508767480060 | X | 47  
1508767480070 | X : Window{start=1508767480070 end=-} | 1508767480070 | X | 47  
1508767480080 | X : Window{start=1508767480080 end=-} | 1508767480080 | X | 47  
1508767480090 | X : Window{start=1508767480090 end=-} | 1508767480090 | X | 47  
1508767480100 | X : Window{start=1508767480100 end=-} | 1508767480100 | X | 47  

This data is plotted on the Y axis (we're talking graphs now) on the "Steering inputs" panel in Grafana. The oversteer metric can be seen in red and will spike when steering input exceeds 180 degrees in either direction.

Taking KSQL for a Spin Using Real-time Device Data

Braking too Hard?

Another anomaly I'd like to detect is when maximum brake pressure is applied for too long. Much like the brake pedal in a real car, the brake pedal I'm using has a very progressive feel, a fair amount of force from your foot is required to hit maximum pressure. If you do hit maximum pressure, it shouldn't be for long as you will most likely lock the wheels and skid off the race track, very embarrassing indeed.

The first thing to do is to create a table that will store the last time maximum brake pressure was applied. This table is based on the brake_inputs stream and filters where the value = 100

CREATE TABLE max_brake_power_time \  
WITH (kafka_topic = 'max_brake_power_time') AS \  
SELECT  axis, \  
        MAX(timestamp) as last_max_brake_ts \
FROM    brake_inputs \  
WHERE     value = 100 \  
GROUP by axis;  

A query of this table displays a new row each time maximum brake pressure is hit.

ksql> SELECT axis, last_max_brake_ts FROM max_brake_power_time;  
 Y | 1508769263100
 Y | 1508769267881
 Y | 1508769271568

Something worth mentioning is that if I hold my foot on the brake pedal at the maximum pressure for any period of time, only one event is found in the stream. This is because the device only streams data when the state of an axis changes. If I keep my foot still, no new events will appear in the stream. I'll deal with this in a minute.

Next we'll create a new stream based on the brake_inputs stream and join it to our max_brake_power_time table.

CREATE STREAM brake_inputs_with_max_brake_power_time \  
WITH ( kafka_topic = 'brake_inputs_with_max_brake_power_time') AS \  
SELECT  bi.value, \  
        bi.timestamp, \
        mb.last_max_brake_ts, \
        bi.timestamp - mb.last_max_brake_ts AS time_since_max_brake_released \
FROM    brake_inputs bi LEFT JOIN max_brake_power_time mb ON bi.axis = mb.axis;  

For each row in this stream we now have access to all columns in the brake_inputs stream plus a timestamp telling us when max brake power was last reached. With this data we create a new derived column bi.timestamp - mb.last_max_brake_ts AS time_since_max_brake_released which gives a running calculation of the difference between the current record timestamp and the last time maximum brake pressure was applied

For example, when we query the stream we can see that maximum pressure was applied at timestamp 1508772739115 with a value of 100.0. It's the row immediately after this row that we're are interested in 99.90234225 | 1508772740803 | 1508772739115 | 1688.

Again, I've manually added column headings to make it easier to read.

ksql> SELECT value, timestamp, last_max_brake_ts, time_since_max_brake_released FROM brake_inputs_with_max_brake_power_time;

98.53513389 | 1508772739100 | 1508772733146       | 5954  
98.82810711 | 1508772739100 | 1508772733146       | 5954  
99.02342259 | 1508772739115 | 1508772733146       | 5969  
99.51171129 | 1508772739115 | 1508772733146       | 5969  
99.70702677 | 1508772739115 | 1508772733146       | 5969  
100.0       | 1508772739115 | 1508772733146       | 5969  
99.90234225 | 1508772740803 | 1508772739115       | 1688  
99.51171129 | 1508772740818 | 1508772739115       | 1703  
99.12108033 | 1508772740818 | 1508772739115       | 1703  
97.65621423 | 1508772740818 | 1508772739115       | 1703  
96.58197909 | 1508772740818 | 1508772739115       | 1703  
95.41008621 | 1508772740818 | 1508772739115       | 1703  
94.43350881 | 1508772740818 | 1508772739115       | 1703  
93.65224689 | 1508772740818 | 1508772739115       | 1703  
93.35927367 | 1508772740818 | 1508772739115       | 1703  
92.87098496 | 1508772740834 | 1508772739115       | 1719  
92.38269626 | 1508772740834 | 1508772739115       | 1719  
91.11314564 | 1508772740834 | 1508772739115       | 1719  
90.62485694 | 1508772740834 | 1508772739115       | 1719  
90.42954146 | 1508772740834 | 1508772739115       | 1719  
89.35530632 | 1508772740834 | 1508772739115       | 1719  
87.89044022 | 1508772740834 | 1508772739115       | 1719  
87.40215152 | 1508772740850 | 1508772739115       | 1735  
86.52323186 | 1508772740850 | 1508772739115       | 1735  

Remember, that while an axis is held at the same value, 100.0 in this case, no more events will appear in the stream until the value changes again. This is why we are interested in the row preceding the maximum value, this row is telling us how long the value of 100.0 was applied for. In this case the time it was held for was 1688 milliseconds. Notice that on subsequent rows the value increases, but we are not interested in those rows. In order to isolate what we want, we need another table. This new table takes our previously created stream, brake_inputs_with_max_brake_power_time and groups it by the last_max_brake_ts column. For each grouping we then get the MIN(time_since_max_brake_released).

CREATE TABLE hard_braking WITH ( kafka_topic = 'hard_braking') AS \  
SELECT  last_max_brake_ts, \  
        MIN(time_since_max_brake_released) AS time_spent_at_max_brake_ms \
FROM    brake_inputs_with_max_brake_power_time \  
GROUP BY last_max_brake_ts;  

When we query this table, while stepping hard on the brake pedal for a few seconds at a time, we get the information we want. We can see the timestamp for when maximum brake pressure reached and for how long it was sustained.

ksql> SELECT last_max_brake_ts, time_spent_at_max_brake_ms FROM hard_braking;  
1508775178693 | 1360  
1508775178693 | 1360  
1508775183334 | 1000  
1508775183334 | 1000  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775187709 | 422  
1508775191256 | 1344  
1508775191256 | 1344  
1508775191256 | 1344  
1508775195850 | 1687  
1508775195850 | 1687  
1508775195850 | 1687  
1508775200662 | 1922  
1508775200662 | 1922  
1508775200662 | 1922  
1508775200662 | 1922  

Here's what the above data looks like when visualised in Grafana. The bottom graph is showing when maximum brake pressure was hit and on for how long it was sustained. I've set a threshold against the graph of 1 second so any extreme braking is clearly identifiable - if you're that hard on the brakes for that long, you're probably going to end up in the scenery.

Taking KSQL for a Spin Using Real-time Device Data

The Tale of 2 Laps

After putting it all together, it's time to take to the track and see how it looks. This video shows 2 complete laps onboard with the Caterham Seven 620R around Brands Hatch in the UK. The first lap is a relatively smooth one and the second is quite ragged. Notice that the first lap ( lap 68 ) is quicker overall than the second ( lap 69 ). On lap 69, I start to drive more aggressively and oversteer spikes start to appear in the steering input graph. Lap 69 also has significantly more events overall than lap 68 as a result my more exuberant ( slower ) driving style. You'll also notice that maximum brake pressure is reached a couple of times on each lap, but for no longer than the threshold of 1 second on each occurrence.


KSQL is awesome! Although it's only a developer preview at this point, it's impressive what you can get done with it. As it evolves over time and mirrors more of the functionality of the underlying Streams API it will become even more powerful, lowering the barrier to entry for real-time stream processing further and further. Take a look at the road map to see what may be coming next.

Oh, and I recently discovered on the #KSQL community Slack group, that you can execute KSQL in Embedded Mode right inside your Java code, allowing you to mix the native Streams API with KSQL - very nice indeed !

Categories: BI & Warehousing

Presenting at Cloud day event of North India Chapter of AIOUG

Amardeep Sidhu - Mon, 2017-11-06 05:47

I will be presenting a session titled “An 18 pointers guide to setting up an Exadata machine” at Cloud Day being organized by North India chapter of AIOUG. Vivek Sharma is doing multiple sessions on various cloud and performance related topics. You can register for the event here 


Categories: BI & Warehousing

ksplice kernel updates and Exadata patching

Amardeep Sidhu - Sun, 2017-11-05 11:32

If you have installed some one off ksplice fix for kernel on Exadata, remember to uninstall it before you do a kernel upgrade  eg regular Exadata patching. As such fixes are kernel version specific so they may not work with the newer version of the kernel. 

Categories: BI & Warehousing

ORA-15040 ORA-15042 with EXTERNAL redundancy Diskgroup

Amardeep Sidhu - Fri, 2017-11-03 12:57

A colleague was working on an ASM issue (Standalone one, Version on AIX) at one of the customer sites. Later on, I also joined him. The issue was that the customer added few news disks to an existing diskgroup. Everything went well and the rebalance kicked in. After some time, something happened and all of a sudden the diskgroup was dismounted. While trying the mount the diskgroup again, it was giving

ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "27" is missing from group number "2"

Here is the relevant text from the ASM alert log

ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 19: <strong>No such device</strong>
Additional information: -1
Additional information: 1048576
WARNING: <strong>Write Failed</strong>. group:2 disk:27 AU:1005 offset:0 size:1048576
Fri Nov 03 10:55:27 2017
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_dbw0_58983380.trc:
ORA-27063: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 19: No such device
Additional information: -1
Additional information: 4096
WARNING: Write Failed. group:2 disk:27 AU:0 offset:16384 size:4096
NOTE: cache initiating offline of disk 27 group DATADG
NOTE: process _dbw0_+asm1 (58983380) initiating offline of disk 27.3928481273 (DISK_01) with mask 0x7e in group 2
Fri Nov 03 10:55:27 2017
WARNING: Disk 27 (DISK_01) in group 2 mode 0x7f is now being offlined
WARNING: Disk 27 (DISK_01) in group 2 in mode 0x7f is now being taken offline on ASM inst 1
NOTE: initiating PST update: grp = 2, dsk = 27/0xea27ddf9, mask = 0x6a, op = clear
ERROR: failed to copy file +DATADG.263, extent 1952
GMON updating disk modes for group 2 at 36 for pid 9, osid 58983380
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 2)
ERROR: ORA-15080 thrown in ARB0 for group number 2
Errors in file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_57672234.trc:
ORA-15080: synchronous I/O operation to a disk failed
Fri Nov 03 10:55:27 2017
NOTE: stopping process ARB0
WARNING: Disk 27 (DISK_01) in group 2 mode 0x7f offline is being aborted
WARNING: Offline of disk 27 (DISK_01) in group 2 and mode 0x7f failed on ASM inst 1
NOTE: halting all I/Os to diskgroup 2 (DATADG)
Fri Nov 03 10:55:28 2017
NOTE: cache dismounting (not clean) group 2/0xDEB72D47 (DATADG)
NOTE: messaging CKPT to quiesce pins Unix process pid: 62128816, image: (B000)
NOTE: dbwr not being msg'd to dismount
Fri Nov 03 10:55:28 2017
NOTE: LGWR doing non-clean dismount of group 2 (DATADG)
NOTE: LGWR sync ABA=124.7138 last written ABA 124.7138
NOTE: cache dismounted group 2/0xDEB72D47 (DATADG)
SQL> alter diskgroup DATADG dismount force /* ASM SERVER */ 

At this stage disk 27 was not readable even with dd. So that means something is wrong with the disk. Since it is an external redundancy diskgroup not much can be done until the disk becomes available.

Speaking to the storage team cleared the air. One that the disk had gone offline at storage level so that is why even dd was not able to read it. Two that all these disks were thin provisioned (over provisioning of the storage space to improve the utilization; similar to over provisioning of CPU cores in the Virtualization world) from the storage. This particular disk 27 was meant for some other purpose but got wrongly allocated to this diskgroup. The actual space available in the pool (of this disk) was less than what was needed. The moment disks were added to the diskgroup, the rebalance kicked in and ASM started writing data to the disk. Within few minutes space became full and the storage software took the disk offline. Since ASM couldn’t write to the disk, the diskgroup was dismounted.

Fortunately, in the same pool, there was another disk that was still unused. So the storage guy dropped that disk and it freed up some space in the pool. He brought this disk 27 online after that. Diskgroup got mounted and the rebalance kicked in again. Finally, we dropped this disk and the rebalance started again. Once the rebalance completed, disk was free to be taken offline.


Categories: BI & Warehousing

KSQL: Streaming SQL for Apache Kafka

Rittman Mead Consulting - Wed, 2017-10-18 10:18
 Streaming SQL for Apache Kafka

Few weeks back, while I was enjoying my holidays in the south of Italy, I started receiving notifications about an imminent announcement by Confluent. Reading the highlights almost (...I said almost) made me willing to go immediately back to work and check all the details about it.
The announcement regarded KSQL: a streaming SQL engine for Apache Kafka!

My office today... not bad! #sea

— Francesco Tisiot (@FTisiot) August 7, 2017

Before going in detail, lets try to clarify the basics: what is KSQL? Why was it introduced and how does it complement Kafka?

What is KSQL?

We have been writing about Kafka several times, including my recent blogs were I was using it as data hub to capture Game of Thrones tweets and store them in BigQuery in order to do sentiment analysis with Tableau. In all our examples Kafka has been used just for data transportation with any necessary transformation happening in the target datastore like BigQuery, with the usage of languages like Python and engines like Spark Streaming or directly in the querying tool like Presto.

KSQL enables something really effective: reading, writing and transforming data in real-time and a scale using a semantic already known by the majority of the community working in the data space, the SQL!

 Streaming SQL for Apache Kafka

KSQL is now available as developer preview, but the basic operations like joins, aggregations and event-time windowing are already covered.

What Problem is KSQL Solving?

As anticipated before, KSQL solve the main problem of providing a SQL interface over Kafka, without the need of using external languages like Python or Java.
However one could argue that the same problem was solved before by the ETL operations made on the target datastores like Oracle Database or BigQuery. What is the difference then in KSQL approach? What are the benefits?

The main difference in my opinion is the concept of continuous queries: with KSQL transformations are done continuously as new data arrives in the Kafka topic. On the other side transformations done in a database (or big data platforms like BigQuery) are one off and if new data arrives the same transformation has to be executed again.

 Streaming SQL for Apache Kafka

So what is KSQL good for? Confluent's KSQL introduction blog post provides some use cases like real time analytics, security and anomaly detection, online data integration or general application development. From a generic point of view KSQL is what you should use when transformations, integrations and analytics need to happen on the fly during the data stream. KSQL provides a way of keeping Kafka as unique datahub: no need of taking out data, transforming and re-inserting in Kafka. Every transformation can be done Kafka using SQL!

As mentioned before KSQL is now available on developer preview and the feature/function list is somehow limited compared to more mature SQL products. However in cases where very complex transformations need to happen those can still be solved either via another language like Java or a dedicated ETL (or view) once the data is landed in the destination datastore.

How does KSQL work?

So how does KSQL work under the hood? There are two concepts to keep in mind: streams and tables. A Stream is a sequence of structured data, once an event was introduced into a stream it is immutable, meaning that it can't be updated or deleted. Imagine the number of items pushed or pulled from a storage: "e.g. 200 pieces of ProductA were stocked today, while 100 pieces of ProductB were taken out".
A Table on the other hand represents the current situation based on the events coming from a stream. E.g. what's the overall quantity of stocks for ProductA? Facts in a table are mutable, the quantity of ProductA can be updated or deleted if ProductA is not anymore in stock.

 Streaming SQL for Apache Kafka

KSQL enables the definition of streams and tables via a simple SQL dialect. Various streams and tables coming from different sources can be joined directly in KSQL enabling data combination and transformation on the fly.

Each stream or table created in KSQL will be stored in a separate topic, allowing the usage of the usual connectors or scripts to extract the informations from it.

KSQL in Action Starting KSQL

KSQL can work both in standalone and client-server mode with the first one aimed at development and testing scenarios while the second supporting production environments.
With the standalone mode KSQL client and server are hosted on the same machine, in the same JVM. On the other side, in client-server mode, a pool of KSQL server are running on remote machine and the client connects to them over HTTP.

For my test purposes I decided to use the standalone mode, the procedure is well explained in confluent documentation and consist in three steps:

  • Clone the KSQL repository
  • Compile the code
  • Start KSQL using local parameter
./bin/ksql-cli local
Analysing OOW Tweets

I'll use for my example the same Twitter producer created for my Wimbledon post. If you notice I'm not using the Kafka Connect, this is due to KSQL not supporting AVRO formats as of now (remember is still in dev phase?). I had then to rely on the old producer which stored the tweet in JSON format.

For my tests I've been filtering the tweets containing OOW17 and OOW (Oracle Open World 2017), and as mentioned before, those are coming in JSON format and stored in a Kafka topic named rm.oow. The first step is then to create a Stream on top of the topic in order to structure the data before doing any transformation.
The guidelines for the stream definition can be found here, the following is a cutdown version of the code used

CREATE STREAM twitter_raw ( \  
  Created_At VARCHAR, \
  Id BIGINT, \
  Text VARCHAR, \
  Source VARCHAR, \
  Truncated VARCHAR, \
  User VARCHAR, \
  Retweet VARCHAR, \
  Contributors VARCHAR, \
  ...) \
WITH ( \  
  kafka_topic='rm.oow', \
  value_format='JSON' \

Few things to notice:

  • Created_At VARCHAR: Created_At is a timestamp, however in the first stream definition I can't apply any date/timestamp conversion. I keep it as VARCHAR which is one of the allowed types (others are BOOLEAN, INTEGER, BIGINT, DOUBLE, VARCHAR, ARRAY<ArrayType> and MAP<VARCHAR, ValueType>).
  • User VARCHAR: the User field is a JSON nested structure, for the basic stream definition we'll leave it as VARCHAR with further transformations happening later on.
  • kafka_topic='rm.oow': source declaration
  • value_format='JSON': data format

Once created the first stream we can then query it in SQL like

select Created_at, text from twitter_raw  

with the output being in the form of a continuous flow: as soon as a new tweet arrives its visualized in the console.

 Streaming SQL for Apache Kafka

The first part I want to fix now is the Created_At field, which was declared as VARCHAR but needs to be mutated into timestamp. I can do it using the function STRINGTOTIMESTAMP with the mask being EEE MMM dd HH:mm:ss ZZZZZ yyyy. This function converts the string to a BIGINT which is the datatype used by Kafka to store timestamps.

Another section of the tweet that needs further parsing is the User, that as per the previous definition returns the whole nested JSON object.

"name":"Francesco Tisiot",
"location":"Verona, Italy","url":"",

Fortunately KSQL provides the EXTRACTJSONFIELD function that we can then use to parse the JSON and retrieve the required fields

I can now define a new twitter_fixed stream with the following code

create stream twitter_fixed as  
  select STRINGTOTIMESTAMP(Created_At, 'EEE MMM dd HH:mm:ss ZZZZZ yyyy') AS  Created_At, \
    Id, \
    Text, \
    Source, \
    ..., \
    EXTRACTJSONFIELD(User, '$.name') as User_name, \
    EXTRACTJSONFIELD(User, '$.screen_name') as User_screen_name, \
    EXTRACTJSONFIELD(User, '$.id') as User_id, \
    EXTRACTJSONFIELD(User, '$.location') as User_location, \
    EXTRACTJSONFIELD(User, '$.description') as description \
  from twitter_raw

An important thing to notice is that the Created_At is not encoded as BigInt, thus if I execute select Created_At from twitter_fixed I get only the raw number. To translate it to a readable date I can use the STRINGTOTIMESTAMP function passing the column and the data format.

The last part of the stream definition I wanted to fix is the settings of KEY and TIMESTAMP: a KEY is the unique identifier of a message and, if not declared, is auto-generated by Kafka. However the tweet JSON contains the Id which is Twitter's unique identifier, so we should to use it. TIMESTAMP associates the message timestamp with a column in the stream: Created_At should be used. I can defined the two above in the WITH clause of the stream declaration.

create stream twitter_with_key_and_timestamp \  
as \  
select * from twitter_fixed \  
with \  
(KEY='Id', TIMESTAMP='Created_At');

When doing a select * from twitter_with_key_and_timestamp we can clearly see that KSQL adds two columns before the others containing TIMESTAMP and KEY and the two are equal to Created_At and Id.

 Streaming SQL for Apache Kafka

Now I have all the fields correctly parsed as KSQL stream, nice but in my previous blog post I had almost the same for free using Kafka Connect. Now It's time to discover the next step of KSQL: tables!

Let's first create a simple table containing the number of tweets by User_name.

create table tweets_by_users as \  
select user_screen_name, count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_screen_name  

When then executing a simple select * from table we can see the expected result.

 Streaming SQL for Apache Kafka

Two things to notice:

  • We see a new row in the console every time there is a new record inserted in the oow topic, the new row contains the updated count of tweets for the screen_name selected
  • The KEY is automatically generated by KSQL and contains the screen_name

I can retrieve the list of tables define with the show tables command.

 Streaming SQL for Apache Kafka

It's interesting to notice that the format is automatically set as JSON. The format property, configured via the VALUE_FORMAT parameter, defines how the message is stored in the topic and can either be JSON or DELIMITED.


When grouping, KSQL provides three different windowing functions:

  • Tumbling: Fixed size, non overlapping. The SIZE of the window needs to be specified.
  • Hopping: Fixed size, possibly overlapping. The SIZE and ADVANCE parameters need to be specified.
  • Session: Fixed size, starting from the first entry for a particular Key, it remains active until a new message with the same key happens within the INACTIVITY_GAP which is the parameter to be specified.

 Streaming SQL for Apache Kafka

I can create simple table definition like the number of tweets by location for each tumbling session with

create table rm.tweets_by_location \  
as \  
select user_location, \  
count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_location  

the output looks like

 Streaming SQL for Apache Kafka

As you can see the KEY of the table contains both the user_location and the window Timestamp (e.g Colombes : Window{start=1507016760000 end=-})

An example of hopping can be created with a similar query

create table rm.tweets_by_location_hopping \  
as \  
select user_location, \  
count(Id) nr_of_tweets \  
from twitter_with_key_and_timestamp \  
group by user_location;  

With the output being like

 Streaming SQL for Apache Kafka

It's interesting to notice that each entry (e.g. Europe North, Switzerland) is listed at least three times. This is due to the fact that in any point in time there are three overlapping windows (SIZE is 30 seconds and ADVANCE is 10 seconds). The same example can be turn into the session windows by just defining WINDOW SESSION (30 SECONDS).

The windowing is an useful option, especially when combined with HAVING clauses since it gives the option to define metrics for real time analysis.
E.g. I may be interested only items that have been ordered more than 100 times in the last hour, or, in my twitter example in user_locations having a nr_of_tweets greater than 5 in the last 30 minutes.


So far so good, a nice set of SQL functions on top of data coming from a source (in my case twitter). In the real word however we'll need to mix information coming from disparate sources.... what if I tell you that you can achieve that in a single KSQL statement?

 Streaming SQL for Apache Kafka

To show an integration example I created a simple topic known_twitters using the kafka-console-producer.

./bin/kafka-console-producer --topic known_twitters --broker-list myserver:9092

Once started I can type in messages and those will be stored in the known_twitters topic. For this example I'll insert the twitter handle and real name of known people that are talking about OOW. The format will be:



FTisiot,Francesco Tisiot  
Nephentur,Christian Berg  

Once inserted the rows with the producer I'm then able to create a KSQL stream on top of it with the following syntax (note the VALUE_FORMAT='DELIMITED')

create stream people_known_stream (\  
screen_name VARCHAR, \  
real_name VARCHAR) \  
WITH (\  
KAFKA_TOPIC='known_twitters', \  

I can now join this stream with the others streams or tables built previously. However when trying the following statement

select user_screen_name from rm.tweets_by_users a join PEOPLE_KNOWN_STREAM b on a.user_screen_name=b.screen_name;  

I get a nice error

Unsupported join logical node: Left: io.confluent.ksql.planner.plan.StructuredDataSourceNode@6ceba9de , Right: io.confluent.ksql.planner.plan.StructuredDataSourceNode@69518572  

This is due to the fact that as of now KSQL supports only joins between a stream and a table, and the stream needs to be specified first in the KSQL query. If I then just swap the two sources in the select statement above:

select user_screen_name from PEOPLE_KNOWN_STREAM a join rm.tweets_by_users b on a.screen_name=b.user_screen_name;  

...I get another error

Join type is not supportd yet: INNER  

We have to remember that KSQL is still in developer beta phase, a lot of new features will be included before the official release.

adding a LEFT JOIN clause (see bug related) solves the issue and I should be able to see the combined data. However when running

select * from PEOPLE_KNOWN_STREAM left join TWEETS_BY_USERS on screen_name=user_screen_name;  

Didn't retrieve any rows. After adding a proper KEY to the stream definition

as select screen_name , \  
real_name from  people_known_stream \  
PARTITION BY screen_name;  

I was able to retrieve the correct rowset! Again, we are in early stages of KSQL, those fixes will be enhanced or better documented in future releases!


As we saw in this small example, all transformations, summaries and data enrichments were done directly in Kafka with a dialect very easy to learn for anyone already familiar with SQL. All the created streams/tables are stored as Kafka topics thus the standard connectors can be used for sink integration.

As mentioned above KSQL is still in developer preview but the overall idea is very simple and at the same time powerful. If you want to learn more check out the Confluent page and the KSQL github repository!

Categories: BI & Warehousing

Incremental ETL : Streaming via Micro-Batch

Dylan's BI Notes - Wed, 2017-10-11 23:40
A modern analytic application takes the approach of streaming data to perform the similar process as the traditional data warehousing incremental ETL. Actually, if we look into Spark Streaming in details, the concept of streaming in Spark and Incremental ETL are the same: Spark Streaming is a Micro-Batch based streaming. Each micro-patch is much like […]
Categories: BI & Warehousing

Incremental ETL – The last refresh date

Dylan's BI Notes - Wed, 2017-10-11 15:31
There are multiple ways to model the last refresh date. In OBIA, DAC and Informatica based ETL, the last refresh date is maintained within DAC. It is maintained at the level of the source tables that populates the data. Oracle BI DAC User Guide > About Refresh Dates and DAC’s Incremental Load Strategy In OBIA […]
Categories: BI & Warehousing

Use Bit to represent groups

Dylan's BI Notes - Wed, 2017-10-11 03:17
Here I am providing an alternate approach of supporting group membership in MySQL. It is a common seen requirement that a group may have multiple members and a person may be added to multiple groups.  This many to many relationship is typically modeled in an intersection table. When the group membership is being used as […]
Categories: BI & Warehousing

ODC Appreciation Day: OBIEE's Time Hierarchies

Rittman Mead Consulting - Tue, 2017-10-10 01:58
 OBIEE's Time Hierarchies

After last year successful OTN Appreciation Day, it's time again to show our love for a particular feature in any Oracle's tool we use in our work. You may have noted a name change with OTN now becoming ODC: Oracle Developer Community.


The feature I want to speak about is OBIEE's Time Hierarchies.
For anybody in the BI business the time dimension(s) are the essence of the intelligence bit: being able to analyze trends, compare current period with previous one, plot year to date or rolling measures are just some of the requirements we get on daily basis.
A time hierarchy definition allows the administrator to set which time levels are exposed, how the rollup/drill down works and how previous/following members of the level are calculated.
Once the hierarchy is defined, all the related calculations are simple as calling a function (e.g. AGO), defining the level of detail necessary (e.g. Month) and the number of items to take into account (e.g. -1).

A Time hierarchy definition is necessary in the following cases:

  • Time comparisons - e.g. current vs previous month
  • Time related rollups - e.g. Year to date
  • Drill path definition - e.g. Year-Month-Day
  • Fact Tables at various level of details - e.g. daily fact table and monthly pre-aggregated rollup
  • Time related level based measures - e.g. monthly sum of sales coming from a fact table at daily level

Why do I like time hierarchies? Simple! It's a very clever concept in the RPD, which requires particular knowledge and dedicated attention.

If done wright, once defined, is available in every related table and makes the time comparison formulas easy to understand and to create. If done wrong, errors or slowness in the related analysis can be difficult to spot and improve/fix.

Still time hierarchies are a central piece in every BI implementation, so once understood and implemented correctly give a massive benefit to all developers.


We blogged about time dimensions and calculations back in 2007 when OBI was still on version 10! The original functionality is still there and the process to follow is pretty much the same.
Recently was introduced the concept of Logical Sequence Number, a way of speeding up some time series calculations by removing the ranking operations needed to move back (or forth) in history.

 OBIEE's Time Hierarchies

I wanted to keep the blog post short, since the time hierarchies information can be found in millions of blog posts. I just wanted the to give few hints to follow when creating a time hierarchy:

  • It can be created on any data with a predefined order, no need to be a date! you could compare for example a certain product with another in the inventory having the previous code.
  • The Chronological Key defines the sorting of the level, for example how years, months or dates are ordered. Ordering months alphabetically with a format like YYYY-MM it's correct while using MM-YYYY provides wrong results.
  • Double check the hierarchies, something like YEAR-> MONTH -> WEEK -> DATE can be incorrect since a week can be split in different months!
  • Set appropriately the number of elements for each level. This is useful, especially when the hierarchy is complex or pre-aggregated facts, for OBIEE to understand which table to query depending on the level of the analysis.
  • Setup the Logical Sequence Number. LSNs are useful if you are looking to reduce the impact of the time series processing at a minimum.
  • If you are looking for very optimal performances for a specific report, e.g. current year vs previous, physicalizing the time series result, previous year, directly in the table alongside with the current year will give what you're looking for.

This was just a quick overview of OBIEE's Time Hierarchies, why are so useful and what you should be looking after when creating them! Hope you found this short post useful.

Follow the #ThanksODC hashtag on Twitter to check which post have been published on the same theme!

Categories: BI & Warehousing

Schema On Read?

Dylan's BI Notes - Sun, 2017-09-24 07:48
I saw “create external table ” first in Oracle DBMS 11G. It was created for the purpose of loading data. When Hive was introduced, a lot of data were already created in HDFS. Hive was introduced to provided the SQL interface on these data. Using the external table concept is a nature of the design.  […]
Categories: BI & Warehousing

Preserve Surrogate Key During Upgrade

Dylan's BI Notes - Sat, 2017-09-23 07:00
The generated surrogate key is used everywhere in the data warehouse.  What do we do during upgrade? Here are some approaches: 1. Full Refresh You can perform a full refresh of the data warehouse.  The surrogate keys will be regenerated.  The FK will be updated. Obviously, this is not a good approach.  There are problems […]
Categories: BI & Warehousing

Unify Update - v1.0.1

Rittman Mead Consulting - Fri, 2017-09-22 07:54
Unify Update - v1.0.1

Unify Update - v1.0.1

We have updated Unify following feedback from our customers and have released version 1.0.1. The following bugs have been fixed and features added:

  • Change the default port to 3724 as 8080 is the default port of Oracle XE.
  • Allow port configuration in the desktop app.
  • Fixed problem with date filters not working with >, >=, <, <= operators.
  • Fixed some problems using presentation variables used in filters.
  • Made the preview table scale to the resolution of the screen instead of being fixed size.
  • Enabled parsing for dashboard pages, so an OBIEE page can be opened and each report from it will be loaded into Unify.
  • Made viewing column or filter panes optional in the UI.
  • Improved tray icons for Mac distribution of the Desktop app.
  • Distinguish measures and attributes with icons in the presentation layer.
  • Allow queries from multiple subject areas.
  • Switched to Tableau WDC 2.0.9 to facilitate compatbility with Tableau 10.0.

You download Unify from our website:

Categories: BI & Warehousing

Unified Data Model or Not

Dylan's BI Notes - Wed, 2017-09-13 17:07
Do we need to store the data all together in same places? Do we need to use the same data model ? Do we need to put data into cloud? Storing the data into a central place is not necessary, as nowadays, I do not really know where the data are stored.  If we talk […]
Categories: BI & Warehousing

How to – Incremental ETL

Dylan's BI Notes - Wed, 2017-09-06 13:11
This is a very basic topic.  An ETL 101 question come up a lot in interview. Even we are moving to a different storage and different processing framework, the concepts are still important. The idea is simple – you do not need to keep extracting and updating all data in the data store that are […]
Categories: BI & Warehousing

Game of Thrones S07 Last Episode: The Summary

Rittman Mead Consulting - Fri, 2017-09-01 09:36
 The Summary

Watch the Episode First! It's a friendly suggestion...

The final #GoT episode was transmitted last Sunday, now two years waiting for the next season... How can HBO be so cruel??? And how can I find interesting content for my future blog posts???
At least now European football (not soccer) leagues are back, so TV-side I'm covered!


Going back to serious discussions, Game of Thrones last episode: Yay or Nay? The average sentiment for the episode (taking into account only tweets since Monday) was -0.012: it is negative but represents an improvement when compared to the two previous ones (with episode 6 having the most negative sentiment score).

 The Summary

But... Hey! What is the line on top going in time? The line it's due to the external R call and the fact that is forcing us to include the Tweet Text column in the analysis in order to be evaluated. The evaluation of the sentiment is applied on ATTR(Tweet Text) which means kind of SELECT DISTINCT Tweet_Text in Oracle terms. The line on top is drawn because the same Tweet Text was tweeted across several weeks.

 The Summary

Please notice that the three overall sentiments are close (between 0.01 and 0.10) so, when looking in detail at the distribution of sentiment scores across the episodes we can see that, as expected, are similar.

 The Summary

Zooming to single characters we can see the scatterplot of the last episode, with Jon Snow (or should I say Targaryen?) leading the number of mentions with surprisingly Littlefinger on the second spot and Arya on the third: probably the Baelish dying scene at Winterfell was something highly appreciated by the fans.

 The Summary

On the positive negative feeling almost nothing changed with Arya and the Night King being the negative and positive poles. I've been telling you about change of leadership on the various axes of the scatterplot by visually comparing today's scatterplot with the previous two. However the transition of the character position in the graph can be visualized again on multiple scatterplots.

 The Summary

By creating a scatterplot for each character and assigning to the episodes a different number (E05-1, E06-2, E07-3) I can clearly see how Davos Seaworth for example had a big sentiment variation going very positive in the last episode while Jaime Lanninster was more stable. Zooming into Davos position we can see how the sentiment distribution changed across episodes with the E06 representing the most negative while the E07 has almost all positive tweets.

 The Summary

Looking at the words composing Davos tweets we can immediately spot few thigs:

 The Summary

  • SIR has a positive sentiment (Sir Davos is how several characters call him) which is driving the overall score in the final episode
  • The number of tweets mentioning Davos was very small in E06 compared to the other two (we can see the same from the related scatterplot above)
  • In E07 we see a good number of circles having the same (big) size, possibly is the same text which has been tweeted several times.

To verify the last point we can simply show the Tweet Text along the # of Tweets and discover that almost the same positive Text count for over the 99% of the whole reference to the character.

 The Summary


One of the cool functions of the Syuzhet package is named get_nrc_sentiment and allows the extrapolation of emotions from a text based on the NRC emotion lexicon. The function takes a text as input and returns a data frame containing a row for each sentence and a column for emotion or sentiment.
The sentiment can either be positive or negative which we already discussed a lot previously. The emotion is split in eight categories: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust.

We can extract the eight different emotions into eight calculations with the following code


To calculate the Anger Emotion Score we are passing ATTR(Text), the list of Tweet's texts, and taking the output of the anger column of the dataframe. We can do the same for all the other emotions and create separate graphs to show their average across characters for the last episode. In this case I took Disgust, Anger, Fear, Joy and Trust.

 The Summary

We can then clearly see that Bran Stark is the character that has most Disgust associated to. Bron has a special mix of emotions, he's in the top for Anger, Fear and Joy, such a mix can justify the average sentiment which is close to neutral (see scatterplot above). On the Trust side we can clearly see that the North wins with Arya and Sansa on the top, interesting here is to see also Lord Varys.
Looking into Bran Disgust detail we can see that is driven by the categorization of the BRAN word as disgusting, probably the dictionary doesn't like cereals.

 The Summary

Scene Emotions

In my previous post I've been talking about the "Game of Couples" and how a single character sentiment score could be impacted by a reference to a second character. For the last episode of the series I wanted to look at different scenes: the main characters I want to analyse are Jon Snow, Littlefinger and Sansa. Specifically I want to understand how people on Twitter reacted to the scenes where the two characters had a big impact: the death of Littlefinger declared by Sansa and the revelation of Jon Targaryen.

The first thing I wanted to check is the Surprise: How are characters categorized by this emotion? We can see Bron on top being driven by the word GOOD in the related tweets.

 The Summary

We can also notice that Petyr score is quite high (0.2590 and 2nd position) while Jon score is pretty low, probably averaged by the huge number of tweets. We can also see that Sansa score is not very high, even if she is the character providing quite a big shock when accusing Littlefinger.

The overall character average surprise doesn't seem to be very relevant, we need to find a way to filter tweets related to those particular scenes: we can do that by including only few keywords in the Tweet Text. Please note we are going to filter words that will create an OR condition. If a tweet contain ANY of the words mentioned, it will be included.

First I wanted to check which are the words in Jon's tweets driving the Surprise sentiment alongside the # of Tweets

 The Summary

However this is only giving us details on which words are classified as Surprise for Jon, nothing really related to the scenes. I can however filter only the tweets with an overall Surprise sentiment for Jon and check which words are mostly associated with them. I also added a filter for Tweets containing the words TARGARYEN OR SON since I assumed those two could be more frequently used describing the scene.

 The Summary

We can clearly see some patterns that are well recognized correctly by the Surprise metric: both Aegon (a reference to Jon's real name) and Aunt (reference to Lyanna or Deanerys?) are in the top 20 and a little bit further right in the graph we can also spot Father. There probably is also some surprise in tweets related to what's going to happen when Jon finds out he's a Targaryen since all keywords are present in the top 20.

When doing a similar analysis on Sansa I wanted to add another metric to the picture: the Average Sentence Emotion Score for all sentences including a word. With this metric we can see how a word (for example AMAZING) changes the average emotion of the sentences where is included. Analysing this metric alone however wouldn't be useful: obviously the words having more impact on emotion are the ones categorized as such in the related dictionary.

I found interesting the following view for Sansa: we see across all the tweets categorized as Surprizing, which are the words most mentioned (Y-axes) and what's the average Surprise emotion value for the sentences were those words were included.

 The Summary

We can spot that MURDER and TREASON were included with a big number of tweets (>500) having an average Surprise score around 2. This seems to indicate that the scene of Sansa convicting Lord Baelish wasn't expected from the fans.

One last graph shows how the character couples (remember the game of couples in my previous post?) have been perceived: the square color defines the average Surprise score while the position in the X-axis confidence (by the # of Tweets).

 The Summary

We can spot that the couple Cercei and Sansa is the one having most Surprise emotion, followed by Cercei and Daenerys. Those two couples may be expected since the single characters had major parts in the last episode. Something unexpected is the couple Sandor Clegane and Brienne, looking in detail, the surprise is driven by a mention to the word MURDER which is included in 57.76% of the Tweets mentioning both.

 The Summary

A last technical note: during the last few weeks I've collected about 700 thousands tweets, the time to analyse them highly depends on the complexity of the query. For simple counts or sums based only on BigQuery data I could obtain replies in few seconds. For other analysis, especially when sentiment or emotion was included, a big portion of the raw dataset was retrieved from BigQuery into Tableau, passed to R with the function results moved back to Tableau to be displayed. Those queries could take minutes to be evaluated.
As written in my previous blog post, the whole process could be speed up only by pre-processing the data and storing the sentiment/emotion in BigQuery alongside with the data.

 The Summary

my series of blog post about Game of Thrones tweet and press analysis with Kafka, BigQuery and Tableau! See you in two years for the analysis of the next season with probably a whole new set of technology!


Categories: BI & Warehousing

Use Surrogate Key in Data Warehouse

Dylan's BI Notes - Thu, 2017-08-31 07:46
Using surrogate key is part of dimensional modeling technique for populating a data warehouse using a relational database. The original idea was to generate the sequence generated IDs and use them in between the fact and dimension table, so we can avoid using the concatenated string or using composite key to join.  Also, due to […]
Categories: BI & Warehousing


Subscribe to Oracle FAQ aggregator - BI &amp; Warehousing