Blockchain from a Data Science Perspective
Imagine someone were to mint a new coin and give it to you. You then buy something and give it to someone else, and so on. Every time this coin changes hands, a record of the transaction is engraved on the coin. Every transaction in the history of that coin’s existence is in plain sight on the face of the coin—and the longer the coin circulates, the harder for the earliest transactions to be erased or altered. In the physical world, this is both counterintuitive and impossible; yet this is something like what we’re seeing playing out in the digital world.
When you hear about blockchains, the discussion usually centers around some form of cryptocurrency (to date, there are nearly 1400 cryptocurrencies). Blockchain is the distributed ledger infrastructure underneath cryptocurrency transfers. While today’s most common application of blockchain is tracking cryptocurrency transactions, this emergent technology could be harnessed to manage other kinds of data, records, and assets—such as Renewable Energy Certificates, IoT data, intellectual property rights, or election votes, to name a few.
Blockchain enables transactions in a decentralized network, eliminating the middleman from the transaction process. When people transact in a cryptocurrency like bitcoin, transactions are validated through the network rather than through the traditional means of a trusted third party.
Digital currency has been a topic of discussion for decades; in late 2008, a paper published anonymously under the pseudonym Satoshi Nakamoto set blockchain technology in motion. Nakamoto’s paper describes a system to counterbalance the “inherent weaknesses of the trust based model” of banking and commercial transaction on the Internet. The “trust based model” refers to the need for a mediator in transactions, particularly in cases involving disputes and reversals of past transactions. Requiring a mediator to resolve issues of trust between two parties adds to the cost of transaction. Nakamoto observes that “no mechanism exists to make payments over a communications channel without a trusted party.” Nakamoto’s solution: “an electronic payment system based on cryptographic proof instead of trust.”
Transaction validation is key to understanding the inner workings of a blockchain. Validation ensures that the same digital coin can’t be spent twice. One model for this process is the Proof of Work (PoW) model. This is the model Nakamoto used in implementing the Bitcoin blockchain. In the PoW model, miners run cryptographic algorithms and compete to be the first to bundle incoming transactions into a validated block (think of this for now as a “page” of the distributed ledger). Miners are incentivized by the promise of a blockchain reward (e.g., digital currency) each time they succeed in validating a block of new transactions.
The term “miner” can either refer to the machine running algorithms or to the person operating a mining farm. In one sense, miners play the role of accountants in the blockchain. The mining process also results in the creation of new currency. (In the case of Bitcoin, new currency is created at a steadily diminishing rate; there will be no more new bitcoins once twenty one million have been issued.) Finally, and perhaps most importantly, the number of miners is proportional to the security of the blockchain network. The more miners, the more blocks can be approved, resulting in a stronger chain. As we’ll discuss below, the more transactions, the harder it is to break or change the chain. Moreover, as the number of independent miners and frequency of transactions in the system goes up, the probability of a 51% attack, or a single coalition taking control of the system, theoretically goes down.
The level of difficulty of the calculations required to validate transactions is adjusted according to the rate at which miners succeed at validating new blocks. In 2009, a personal computer had sufficient hardware for mining. As the competition among miners increased, the level of computational difficulty and expense went up as well. Miners moved on from using personal computers to GPUs to ASICs. With regards to competition, therefore, the more miners, the more difficult and costly it becomes for new blocks to be verified.
How Blockchain Works (The PoW Model)
A blockchain operates within a P2P (peer-to-peer) network, allowing for transaction data to be copied and shared across the network. When a new transaction occurs at any node or client, it is picked up and propagated. A P2P architecture is used to enable emergent consensus, consensus that happens gradually as a result of individual nodes having a complete copy of records that agrees with the records of the majority of the other nodes.
Sites with blockchain explorers such as Blockchain info, Block Cypher, or Bitcoin Block Explorer allow visitors to see blockchain in action in real time. These sites show streams of new, unconfirmed transactions. Roughly every 10 minutes (although this validation time can vary considerably), a new block in the chain is created when the latest batch of new transactions in the network is successfully validated. Each block, once validated, is copied and distributed throughout the network so that the next batch of incoming transactions can be processed based on data the network has documented as valid.
Transactions are broken down into inputs and outputs on a blockchain ledger. Inputs constitute “value in” (debit), while outputs are “value out” (credit). Inputs and outputs are labeled with the unique wallet addresses, or public keys, of the parties involved in the value transfer. In a Bitcoin transaction, the numbers don’t necessarily add up perfectly: outputs are often slightly less than the inputs. This difference is the implied transaction fee. Validation also happens at the transaction level. A digital signature made possible through public-private key pairs confirms the identity of the owner of the wallet signing off on the transfer of his or her unspent currency.
The creation of every new block of transactions signals the end of a global mining competition. When miners see a new block has been validated in the system, they know that they have lost their chance at winning that round. However, this also means that a new race to calculate a cryptographic hash for the next block of transactions has begun. Miners will then take the hash from the last validated block and include that data in the header data structure of the next block in progress.
The Bitcoin blockchain uses the SHA-256 hash function to generate a digital fingerprint for the contents of each block. A specified target determines the difficulty level for the mining calculations. The target concerns the first digits of the hexadecimal hash (i.e., digits 0-9 and letters a-f, or a total of sixteen possible characters). An example of a target is a hash output that starts with four zeroes. The goal becomes to generate a hash that is “lower than” the target. For example, consider the hash
The first four digits of the hash are four zeroes. This hash is lower than
The lower the target, the higher the difficulty level, since the likelihood of matching a longer string of zeroes is lower, requiring more computing power to run through more and more iterations to find the right combination of characters that will generate the target cryptographic output.
In mining, the block header data is iteratively hashed with a parameter called a nonce. The nonce is a number that is added to the header data and iteratively incremented, varying the cryptographic hash output. The nonce is adjusted until the resulting hash meets the current requirements of the network’s specified target. Once the target is hit and a new hash is generated for the block, that block is considered validated and is shared across the network. As the new block is propagated, every node that receives it independently validates the hash according to a set of rules to prevent miners that cheat from being rewarded and invalid blocks from being added to the main blockchain. More in-depth technical examples on block hashing and validation can be found in the Bitcoin Wiki and in the chapter on mining in Andreas Antonopoulos’ book, Mastering Bitcoin.
Blocks further back in the main blockchain are considered more secure. This is because as the length of the blockchain increases, each new block’s validation adds another layer of hash computations on top of the existing blocks, reinforcing the chain. Someone who attempts going back down the blockchain to change past transactions will have to rehash all the blocks above and expend tremendous computing power to create a fraudulent ledger with hashes that will not match the majority of the network’s distributed record.
Occasionally, forks will occur in the blockchain if two different miners successfully validate and broadcast a new block at approximately the same time. A competing chain could possibly be a sign of an attempt at double-spending, or worse, an attempt to take control of the chain. Different nodes will then have different versions of what is considered the main blockchain. As Antonopoulos explains:
Bitcoin’s block interval of 10 minutes is a design compromise between fast confirmation times (settlement of transactions) and the probability of a fork. A faster block time would make transactions clear faster but lead to more frequent blockchain forks, whereas a slower block time would decrease the number of forks but make settlement slower. (Mastering Bitcoin, Chapter 8)
When a fork occurs and miners become aware of a competing chain, they are expected to choose the chain that represents more Proof-of-Work, that is, the one that is longer. This permits a re-convergence of the forked chain.
The choice of the longer chain as the legitimate record highlights another design element in the PoW model implemented by Nakamoto. Nakamoto writes that the blockchain will remain “secure as long as honest nodes collectively control more CPU power than any cooperating group of attacker nodes.” In other words, the security of the distributed ledger depends on the inability of any single group to be in control of the majority of computing power in the system.
This high-level summary of how a blockchain works has been focused primarily on Bitcoin, partly because the Bitcoin blockchain is open source and highly documented, but also because Bitcoin is the most widely known real-world experiment with Distributed Ledger Technology. Over the last few years, however, the blockchain concept has fired up the imaginations of a lot of developers and entrepreneurs, and a lot of newer blockchains out there borrow from the Bitcoin model. Ethereum is among the most famous examples of this, expanding the idea of the blockchain from currency to smart contracts.
Blockchains and Data Science
As Anders Brownworth points out in his interactive demo, the linked structure of the blockchain makes it possible to trace the provenance (origin and changes in ownership) of any digital asset. Provenance can provide key evidence in support of the authenticity of an object, asset, or record. If blockchain technology takes off, this could potentially result in large amounts of highly structured, anonymized, and authenticated data assets with transparent provenance. Data scientists with access to blockchain data could in theory build models and make predictions with cleaner, more reliable historical data. Modeling blockchain data against other political, social, and economic trends could also open up new possibilities for research.
Nevertheless, blockchain technology is at a very early and experimental stage and has raised a lot of controversy surrounding questions of regulation, security, privacy, stability and scalability. There is plenty of uncertainty and room for speculation as to its practicality and future. This being said, there is little to say for certain at this point as to the blockchain’s impact, as well as how it will affect the parallel development of data science and machine learning.
While we don’t know what blockchains will do for data science, acknowledging this still leaves room to consider what data science might do for blockchain as attempts are made to bring this technology to maturity. Even a high-level overview of the Bitcoin blockchain reveals a complex transaction system developed through consideration of various probabilities and statistics involved in economics and human behavior. As new blockchain infrastructures are implemented, data science can help guide their design and development as well as assess the impact of blockchain on business processes.
There are already a lot of questions about the blockchain where the tools and techniques of data science could provide a means to answers. Is the PoW model the most efficient and secure way of operating a blockchain? How can the costs of real-time value transfers be minimized? Is it possible to predict the probability of a fork? What transaction patterns could help in detecting fraud in real-time? More generally, what metrics could be developed for the purposes of enhancing security? Such questions are ambitious, but worth asking. Cryptocurrencies may rise and fall, but blockchain is definitely an emergent technology to watch over the coming years.
Antonopoulos, Andreas M. (2014, December 20). Mastering Bitcoin: Unlocking Digital Crypto-currencies. Sebastopol, CA. O’Reilly Media.
Bitcoin wiki. Block hashing algorithm (article). Last modified 12 December 2015. Retrieved February 1, 2018. https://en.bitcoin.it/wiki/Block_hashing_algorithm
Bitcoin wiki. Majority Attack (article). Last modified 10 August 2017. Retrieved February 1, 2018. https://en.bitcoin.it/wiki/Block_hashing_algorithm
Brownworth, Anders. Blockchain Demo (Parts I & II). Retrieved January 16, 2018, from https://anders.com/blockchain/
Low, Christopher (2017, May 3). Blockchains Could Be Every Data Scientist’s Dream. Retrieved January 20, 2018, from Dataconomy, http://dataconomy.com/2017/05/blockchains-data-scientist-dream/
Orcutt, Mike. (2017, October 16). How Blockchain Could Give Us a Smarter Energy Grid. Retrieved January 20, 2018 from MIT Technology Review, https://www.technologyreview.com/s/609077/how-blockchain-could-give-us-a-smarter-energy-grid/
Nakamoto, Satoshi (2008, October). Bitcoin: A Peer-to-Peer Electronic Cash System. Retrieved January 29, 2018, from https://bitcoin.org/en/bitcoin-paper
Stray, Kari. (2017, July 28). How Are New Bitcoins Created? A Brief Guide to Bitcoin Mining. Retrieved January 24, 2018, from Cointelegraph, https://cointelegraph.com/news/how-are-new-bitcoins-created-a-brief-guide-to-bitcoin-mining
Williams, Sean. (2018, January 18). The Basics of Blockchain Technology, Explained in Plain English. Retrieved January 30 , 2018, from The Motley Fool, https://www.fool.com/investing/2018/01/10/the-basics-of-blockchain-technology-explained-in-p.aspx
Leave a reply
Background Recognizing an opportunity to expand...
Time and again, across Red Oak Strategic’s...
The pace of our modern world, and the impressive...
While it might be tempting to liven up a report...
Interaction Design for Data Exploration...
- 2016 Election
- Apache Spark
- Business Intelligence
- Case Studies
- Data Processing
- Data Science
- Data Visualization
- Donald Trump
- Exploratory Data Science
- Financial Analytics
- Hillary Clinton
- Machine Learning
- Political Analytics
- Predictive Analytics
- Private Equity
- Python 3
- R Shiny
- Sparkling Water
- Time Series