Tuesday, December 29, 2015

Why Blockchain users should fear Machine Learning

The anonymity provided by Bitcoin and other blockchain-based technologies has been pitched as a signification and differentiating feature relative to institution-mediated transactions. While some researchers have demonstrated the ability to link an anonymous bitcoin wallet to a real-world entity, there are several simple solutions to this problem (e.g., tumblers).  A new paper published by researchers at Princeton's Center for Information Technology Policy (CITP) has highlighted a new threat to anonymity in blockchain products like Ethereum.

How CITP researchers identify code authors from their compiled binaries

Plagiarism detection software, often based on term frequency-inverse document frequency (TF-IDF) machine learning algorithms, has been used in several contexts for years. The new work from the CITP group has extending these capabilities to detect the same authorial fingerprint in the abstract syntax trees (ASTs) created by lower level programming languages like C/C++.  This most recent paper demonstrates that the group can identify the code author by examining only the compiled code.

The authors point out that this could go a long way towards identifying the authors of malware. But it also has enormous implications in the world of blockchain. The appeal of technologies like Ethereum is the ability to insert executable code into the blockchain. Like C code, Ethereum contracts are compiled to binary code, obfuscating potential coding style information. The expectation is that one could build an anonymous exchange, built on blockchain, that would allow transactions of complex code-based financial and contractual instruments.  The new CITP research now shows that anonymity of the code author may be deduced purely from the compiled binary code embedded in the blockchain.

You may think this is making a mountain out of a molehill--after all, it is likely that the major players in any Ethereum-based system will be public. However, the ability to identify the author doesn't stop at the institution level.  It's reasonable to expect that there would be several code authors within each large institution, each with a particular trading focus. The CITP research suggests that not only could you identify that a particular instrument originated from a particular bank, but you could even identify which department or strategy group authored the code.  And just like that, all of the bank's internal strategy decisions dating back to the beginning of the blockchain will be out in the open.  The simplest tell might be telegraphing the existence of an iceberg order, but there's no doubt someone with such a wealth of inside information--legally gleaned!--could make a fortune at the bank's expense.