Societal biases’ impact on machine learning algorithms. Natural Language Processing. Manipulating the vox populi and ethics in machine learning. You could learn much about these topics at PyData Berlin 2017. I’ve been there and here’s what I saw.

Natural Language Processing

One of the main topics that I found very interesting revolved around machine learning and Natural Language Processing. Especially, the talk by Robert Meyer on the use of Doc2Vec, a neural network framework for text analysis, to analyze user comments found from German news websites.

In general, Word2Vec (Doc2Vec is one of such models that adds source label) technology depends on feeding the model with a huge corpus of text. Then, the neural network trains in reconstructing given words from the corpus as vectors. Thus, Word2Vec helps find words that share a common context and are in a close proximity to one another. A well-trained model asked for ‘king’ - ‘man’ + ‘woman’ should find it to be a ‘queen’.

Now, coming back to Robert Meyer’s talk, he trained his neural network on comments from Spiegel, Zeit and Focus newspapers’ websites. The funny thing was when running the same ‘king’ - ‘man’ + ‘woman’ equation in his trained model he found the answer to be ‘Angela Merkel’. In a more serious manner, Meyer’s hypothesis was that we can identify stereotypical comment for a different news site and that we can probably determine from which website a given comment originated. All in all, he calculated an averaged comment for each website. In that manner, averaged Zeit’s comment offered an overintellectualized gibberish, Focus was full of hate speech and Spiegel served as a toned-down middle ground.

Machine Learning and Blockchains

I also attended the Introduction to Machine Learning with H2O and Python, a talk by Jo-fai Chow. H2O is a technology that allows for an easy use a cluster of servers. You just need to install H2O service on each machine and you can access them all from your computer. It’s nice to see that AI/Machine Learning becomes much more accessible for people from financial services, healthcare, and academic circles.

Another very interesting talk was on relations between Blockchains and Artificial Intelligence. Beside normal talk on cryptocurrencies, we were also shown examples of applications entirely deployed on the Blockchains - using BigchainDB as database layer, business logic on Ethereum and parallel matrix multiplication on Golem.

General Impression

The range of topics was rather wide, from beginner-friendly tutorials in data science Python libraries and frameworks, through NLP and Word2Vec models up to very complicated and niche topics. There were also moments which showed me clearly how much more there is to know. One of these talks was on Polynomial Chaos technique for modelling arbitrary statistical distributions which, to be honest, went totally over my head. To get to the point where we could start discussing the technique itself, speaker had to drag us through multiple slides of heavy math which, as she later told us, equated to ~3 semesters of graduate level statistics. Well, maybe next time…

All in all, I’m looking forward to more conferences on Data Science and Machine Learning and maybe some conferences strictly on mathematics and statistics, because the more you know, the better.