- Uwe's Blog

Writing a boolean array for pandas that can deal with missing values

· 02 Sep 2019

When working with missing data in pandas, one often runs into issues as the main way is to convert data into float columns. pandas provides efficient/native support for boolean columns through the numpy.dtype('bool'). Sadly, this dtype only supports True/False as possible values and no possibility for storing missing...

Why the NYC TLC trip record data is a nice training dataset for Data Engineers

· 22 Aug 2019

The New York City Taxi & Limousine Commission Trip Record Data is a really nice dataset to get started with Data Engineering or teaching it. It has several nice properties that make it quite useful that we will show in this article. We will look at this data using only pandas, not introducing any other tooling. Many...

Data Engineers: The best friends of Data Scientists you forgot to hire.

· 13 Feb 2019

At the moment in Computer Science, there are two hot topics: AI and Blockchain. Behind these two buzzwords, there are industries striving to build successful products. Currently, I work in the sector often labelled as AI. Usually, it is also described with other terms like Machine Learning or Big Data. In this sector the currently most sought-after job is the...

Data Science I/O - A baseline benchmark for 2019

· 27 Jan 2019

Data Science and Machine Learning are tasks that have their own requirements on I/O. As many other tasks, they start out on tabular data in most cases. In contrast to a typical reporting task, they don’t work on aggregates but require the data on the most granular level. Some machine learning algorithms are able to directly work on aggregates but...

PyFlame: profiling running Python processes

· 05 Oct 2018

Identifying performance bottlenecks in long-running processes often involves careful instrumentation ahead or guessing where the root of the problem may be. A very welcome set of tools are the ones that help you diagnose problems of live systems without modifying them. One important tool I recently came across is the pyflame profiler.

Use Numba to work with Apache Arrow in pure Python

· 03 Aug 2018

Apache Arrow is an in-memory memory format for columnar data. In more “plain” English, it is a standard on how to store DataFrames/tables in memory, independent of the programming language. One of its most prominent uses is for the @pandas_udf decorator in Apache Spark to move data quickly between Scala and Python/pandas.

AHL Python Hackathon April 2018

· 19 May 2018

Three weeks ago MAN AHL organised an opensource hackathon at their London office. As part of the Hackathon people should contribute to one of the PyData artifacts they regularly use. To support them in making their first contribution, AHL also coordinated that several core committers of opensource projects were present at the event. I joined in as the representative...

Play interactively with Apache Arrow C++ in xeus-cling

· 17 Dec 2017

Often, we use pyarrow in a Jupyter Notebook during work. With the xeus-cling kernel, we can also use the C++ APIs directly in an interactive fashion in Jupyter.

Akka Streams for extracting Wikipedia Articles

· 24 Feb 2016

Use Akka Streams as a new technique to extract specific articles from the Wikipedia xml dump into single files without the need to fit all data into RAM.

Beats Music Support in Tomahawk (and the long journey on how we got there)

· 18 Jul 2014

tl;dr: With the latest nightlies (Win, Mac) you can now use your Beats Music Subscription in Tomahawk. To use it just install the Beats Music Resolver. Although Beats has a nice API, supporting it was a though cruise through our underlying multimedia stack.

How to get global media keys support for Tomahawk in XFCE4

· 03 Jul 2014

Although there seems to be no native support for controlling a media player via the MPRIS specification in XFCE, you can still set up global shortcuts to use the media keys on your keyboard to control Tomahawk regardless of which application currently has focus.

Replace QJson with Qt's own JSON handling in Qt5

· 29 May 2014

tl;dr: A simple wrapper to use QJson for Qt4 and the built-in JSON parser for Qt5 so that QJson is not required if built with Qt5: qjson-qt5json-wrapper (MIT-licensed, no #ifdef in your code).

Using Tomahawk resolvers in node.js

· 28 Jan 2014

tl;dr: I wrote a node.js module so that you can use packaged Tomahawk AXE archives in your node.js application for querying music services with a unified interface, see node-tomahawkjs

Use Media Keys to control Tomahawk in Awesome WM

· 29 Jan 2013

Nowadays for controling a mediaplayer the MPRIS specification exists, sadly this interface seems unsupported by awesome. One solution would be to add some lines to the configuration of xbindkeys and to start it in the background. But as awesome already can handle global keybindings adding these lines to your .config/awesome/rc.lua will transmit the actions of...

tomahawk-0.6.0beta1 added to Gentoo Overlay

· 10 Jan 2013

Yesterday was the release of the first beta for Tomahawk 0.6.0. Tomahawk is a music player which decouples the name of the song from the source it is streamed from. Using the Playdar API it resolves the location from where to stream using all of your available sources.