Cryptography: Hash Functions
UCL Course COMP0143 Cryptocurrencies: Prerequisite Cryptography Part I
Hash Functions
Motivation
Passwords stored in server database should not be stored in plaintext
Possible Solutions:
-
Using encryption: have to care about the secret keys
-
Using one-way function: have to care about the possible collisions
e.g., given the one-way function \(f\), there could exist a password \(pw^\prime \neq pw\) but \(f(pw^\prime) = f(pw)\)
Definition
Hash Function is a one-way function
which maps arbitrary-sized input \(m\) into an output \(h\) with fixed-length \(l\)
Properties
-
Collision Resistant (CR)
\[Pr[\text{HashColl}_{\mathcal{A},{H^s}}(\lambda) = 1] \leq negl(\lambda)\]where \(H^s: \{0,1\}^{l’(\lambda)} \rightarrow \{0,1\}^{l(\lambda)}\) with \(s \leftarrow Gen(\lambda)\) and \(l’(\lambda) > l(\lambda)\)
Note: \(s\) is public and need not be random.
The \(\text{HashColl}_{\mathcal{A},{H^s}}(\lambda)\) Process:
-
\(s \xleftarrow{$} Gen(1^\lambda)\)
-
\((x,y) \leftarrow \mathcal{A}(1^\lambda,s)\)
-
return \(1\) when \(x \neq y \implies H^s(x) = H^s(y)\) otherwise returns \(0\)
An Easier Explanation:
It is hard to find different \(m_0\) and \(m_1\), such that \(H(m_0) = H(m_1)\)
-
-
Preimage Resistance (PR1)
For given \(x \xleftarrow{$} \{0,1\}^*, h := H^s(x)\),
it is infeasible for any PPT adversary \(\mathcal{A}\) who knows \(s\) and \(h\)
to find \(y \in \{0,1\}^*\) which meets \(h = H^s(y)\)
An Easier Explanation:
It is hard to find a \(m\) for a given \(h\), such that \(h = H(m)\)
-
Second Preimage Resistance (PR2)
For given \(x \xleftarrow{$} \{0,1\}^*, h := H^s(x)\),
it is infeasible for any PPT adversary \(\mathcal{A}\) who knows \(s\) and \(x\)
to find \(y \in \{0,1\}^*, y \neq x\) which meets \(h = H^s(y)\)
An Easier Explanation:
It is hard to find a different \(m_1\) for a given \(m_0\), such that \(H(m_0) = H(m_1)\)
Relationship between Properties
-
if \(H\) is CR, then \(H\) is also PR2
-
if \(H\) is PR2, then \(H\) is also PR1 (in practice)
Collisions
Since \(l’(\lambda) > l(\lambda)\), there must exist collisions
Using Brute Force Attacks
Takes exponential time complexity: \(O(2^l)\)
Using Birthday Attacks
Collision Probability
\[\frac{q\cdot (q-1)}{4N} \leq Pr[C_{q,N}] \leq \frac{q \cdot (q-1)}{2N}\]where
-
\(N\) is the number of all possible outputs of \(H\) (i.e., \(N = 2^l\))
-
\(q\) is the number of distinct inputs that are hashed
when \(q \approx \sqrt N\) with probability \(\geq \frac{1}{2} − negl(\lambda)\)
Takes reduced time complexity: \(O(2^\frac{l}{2})\) (still NOT against PR2 or PR1)
Application in Bitcoin: SHA2-256
Merkle-Damgård Function
TBD
Davies-Meyer Compression Function
TBD
Application in Ethereum: Keccak-256
Sponge Function
This is a special method for building cryptographic hash functions
that can process an arbitrary-sized input and produce an arbitrary-sized output.
Parameters
-
State Size \(b\): total size of the internal state (e.g., 1600 bits for Keccak-256),
-
Bitrate Size \(r\): the portion of the state that interacts with the input/output (e.g., 1088 bits for Keccak-256),
-
Capacity Size \(c = b - r\): the portion of the state that provides security (e.g., 512 bits for Keccak-256).
Note: the security level is determined by half the capacity \(\frac{c}{2}\).
Padding Scheme
Use multi-rate padding:
-
First, appends a \(1\) bit
-
Then, followed by many \(0\) bits as needed
-
Finally, ends a \(1\) bit
to ensure the padded message length is a multiple of the rate \(r\).
Absorbing Phase
-
Set the initial state to zero
-
Split the padded message into \(M_i\) blocks (each block length is \(r\))
-
Update the first \(r\) bits of the current state by XORing with the block \(M_i\)
\[state[\, 0: r-1\,] = state[\, 0: r-1\,] \,\oplus \, M_i\] -
Get the next state by applying Keccak-\(f[1600]\) permutation function to the updated state
Squeezing Phase
The hash output is taken from the first \(r\) bits of the state when all blocks are aborted
-
If the desired hash length is less than or equal to \(r\) , it is extracted directly.
-
If more output is needed (e.g., length \(l\)), take the first \(r\) bits of the state as \(z_0\),
and apply Keccak-\(f[1600]\) permutation function again, take the first \(l-r\) bits of the state,
if \(l-r > r\), continue the permutation to get \(z_i\), and finally concatenate them together to get the hash.
Keccak-\(f[1600]\) Permutation Function
State Representation
The state is a 3-dimensional array of bits arranged as \(5 \times 5 \times w\) (e.g., \(w = 64\) for Keccak-256).
Round
The number of rounds \(n\) depends on the permutation width \(w = 2^l\)
and is given by \(n = 12 + 2l\) (e.g., \(n = 24\) for Keccak-256).
Permutation Steps
The state can be visualized as a \(5 \times 5\) grid of lanes and each lane is \(w\) bits long.
Each \(5 \times 5\) grid is called slice which has 5 rows and 5 columns
For each round, the permutation will perform the following steps
-
\(\theta\) Step: computes the parity of each column and mixes this parity into the state (import in-slice diffusion)
-
\(\rho\) Step: applies a cyclic shift (rotation) to each lane to spread bits over the lanes (import inter-slice diffusion)
-
\(\pi\) Step: rearranges the positions of the lanes (complete diffusion)
-
\(\chi\) Step: combines bits using a non-linear function (import non-linearity)
-
\(\iota\) Step: adds a round constant to the state (break symmetry)
Note: please check the official specifications summary for more details.
Random Oracle Model
A hash function \(H\) is a public function and anyone can use it to compute \(H(m)\) when giving \(m\).
Cryptographic security proofs often assume in addition that \(H(m)\) is unpredictable and uniformly distributed.
These proofs view hash functions as a Random Oracle.
Definition
-
A random oracle \(R(\cdot)\) on input \(m\) outputs a \(l\)-bit random string \(s = R(m)\).
-
It is an infinite object and cannot be computed publicly by a PPT algorithm.
-
To obtain \(s = R(m)\), input \(m\) must be queried to the oracle \(R(\cdot)\).
-
Outputs of \(R(\cdot)\) are unpredictable and uniformly distributed.
Relationship with Hash Function
Modelling hash functions \(H\) as a random oracle (or \(H\) is in Random Oracle Model) needs
-
\(H(\cdot)\) should be public, and
-
Distributions of \(R(\cdot)\)’s and \(H(\cdot)\)’s outputs are computationally indistinguishable.
Merkle Trees
Definition
A Merkle tree is a tree data structure in which
-
every leaf node contains the cryptographic hash of a data block, and
-
every non-leaf node contains the cryptographic hash of its child nodes.
Given \(n\) inputs \((x_0, x_1, \dots, x_{n-1})\) with \(x_j \in \{0,1\}^*\), nodes \(a_{i,j}\) in the binary Merkle tree:
\[a_{i,j} = \begin{cases} \, H(x_j), & i = 0 \\ \, H(a_{i-1, 2j} \, || \, a_{i-1, 2j+1}), & 1 \leq i \leq k \end{cases}\]where
-
\(0 \leq j \leq \frac{n}{2^i} - 1\)
-
\(k = log_2n\) which is the Merkle tree height
-
nodes \(a_{i,j}\) when \(i = 0\) are called leaves
-
node \(a_{i,j}\) when \(i = k\) is called Merkle root
Operations
Tree Creating
The Merkle tree is created according to the definition
Note: if there is an odd number of leaf nodes, the last node is duplicated to form a pair.
Root Calculating
The Merkle root is calculated once the tree is created (\(i = k\))
Note: any change in any data block will alter that leaf hash and propagate changes along the path up to the root.
Proof Generating
Process
-
Identify the target leaf hash: \(H(x_j)\).
-
Collect sibling hashes at each level: \(k-1\) hashes of sibling nodes are required to reconstruct the path to the root.
-
Generate proof: the proof \(\pi\) consists of the target leaf hash and the list of sibling hashes at each level.
Note: the Merkle proof is efficient since only a small subset of the tree is needed to verify the inclusion.
Inclusion Verifying
The verification can be done by comparing the root from the reconstructed path according to the given proof.
Tree Updating
Updating (adding, removing, modifying) data blocks need to recompute the affected hashes.
Tree Pruning
Reduce storage requirements by
-
removing unnecessary parts of the tree, while
-
retaining the ability to verify data inclusion.
Application in Bitcoin: Binary Merkle Tree
TBD
Application in Ethereum: Merkle Patricia Trie
Ethereum needs to efficiently store, update, and check key-value pairs with variable-length keys.
Radix Trie
A radix trie is a space-optimized trie where each node with only one child is merged with its child.
This compression reduces the number of nodes and edges, saving memory and potentially improving lookup times.
Key-Value Node Structure
[i_0, i_1, ..., i_n, value]
where
-
i_0, i_1, ... i_n
is the key of the node (represented by the symbols of the alphabet) -
value
is the value of the node (stores the value)
Key-Value Node Example
Given a key-value pair <'dog', 4>
-
Turn
dog
into letters of the alphabet:d o g
-
Follow path to find the value:
root -> d -> o -> g -> 4
Merkle Radix Trie
Provide integrity using Keccak-256 Hash Function and Recursive Length Prefix (RLP) Serialization
\[\textit{path} := \text{Keccak-256}\Big(\text{RLP}(\textit{value})\Big)\]Modifications
-
The atomic unit becomes a single hex character (called nibble)
-
Each node can maximally refer to 16 children and have a
value
element -
Node structure becomes a 17 element array
[i_0, i_1, ..., i_15, value]
Note: this 17 element array is the Branch
node type in Ethereum.
Recursive Length Prefix Serialization
Merkle Patricia Trie
Patricia is short for Practical Algorithm To Retrieve Information Coded in Alphanumeric.
Inefficiency of Merkle Radix Trie
Since each hex character in path
corresponds to a node in the trie,
and path
in Ethereum is usually 64 hex characters long (32 bytes).
-
Space Inefficiency: need over a kilobyte of extra space to store one level per character
-
Time Inefficiency: need to take the full 64 steps for each lookup or delete
Optimization via Node Types
-
NULL: represented as the empty string
-
Branch: 17 element array
[ v0, v1, ..., v15, vt ]
-
Leaf: 2 element array
[ encodedPath_L, value ]
-
Extension: 2 element array
[ encodedPath_E, key ]
When reaching a node where no divergent path exists for at least part of the way down,
the default method is to create up to 15 NULL nodes along the path.
To avoid this, shortcut the descent by setting up an Extension node of the form [ encodedPath_E, key ]
,
-
encodedPath_E
contains the partial path to skip ahead via a compact encoding -
key
is used for the next lookup
Then, a Leaf node can be marked by a flag in the first hex character of the encodedPath_L
,
the path encodes all prior node’s path fragments such that can look up the value directly.
Ambiguity Solution
When traversing paths in hex characters, may end up with an odd number of hex characters to traverse.
However, all data are stored in bytes format, and this will introduce ambiguity.
For example, one hex character 1
and two hex characters 01
are both stored as 01
.
Therefore, the partial path is prefixed with a flag to specify odd length.
def compact_encode(hexarray):
# Determine if there is a termination flag (Leaf nodes will have a termination flag)
term = 1 if hexarray[-1] == 16 else 0
if term:
hexarray = hexarray[:-1] # Remove the last element if it's 16
# Check if the length of the hex array is odd
oddlen = len(hexarray) % 2
# Set the flags: 2 * term + oddlen
flags = 2 * term + oddlen
# Adjust the array based on whether its length is odd
if oddlen:
hexarray = [flags] + hexarray
else:
hexarray = [flags, 0] + hexarray
# Ensure that the hexarray has an even length and the first hex character is the flags
output = bytearray()
# Combine two hex characters into a byte
for i in range(0, len(hexarray), 2):
output.append(16 * hexarray[i] + hexarray[i + 1])
# Convert each byte to its hexadecimal representation without '0x' prefix
output = [format(byte, '02x') for byte in output]
return output
To simplify:
-
Add
0x00
as prefix if the node is an Extension node and the path consists of even number of hex characters.[ 0x0, 0x1, 0x2, 0x3, 0x4, 0x5 ] => [ '00', '01', '23', '45' ]
-
Add
0x1
as prefix if the node is an Extension node and the path consists of odd number of hex characters.[ 0x1, 0x2, 0x3, 0x4, 0x5 ] => [ '11', '23', '45' ]
-
Add
0x20
as prefix if the node is a Leaf node and the path consists of even number of hex characters.[ 0x0, 0xf, 0x1, 0xc, 0xb, 0x8, 0x10 ] => [ '20', '0f', '1c' ,'b8' ]
-
Add
0x3
as prefix if the node is a Leaf node and the path consists of odd number of hex characters.[ 0xf, 0x1, 0xc, 0xb, 0x8, 0x10 ] => [ '3f', '1c', 'b8' ]
Comprehensive Example
Input
<'do'> = <64 6f> : 'verb'
<'dog'> = <64 6f 67> : 'puppy'
<'doge'> = <64 6f 67 65> : 'coins'
<'horse'> = <68 6f 72 73 65> : 'stallion'
Output
rootHash: [ <16>, hashA ]
hashA: [ <>, <>, <>, <>, hashB, <>, <>, <>, [ <20 6f 72 73 65>, 'stallion' ], <>, <>, <>, <>, <>, <>, <>, <> ]
hashB: [ <00 6f>, hashC ]
hashC: [ <>, <>, <>, <>, <>, <>, hashD, <>, <>, <>, <>, <>, <>, <>, <>, <>, 'verb' ]
hashD: [ <17>, hashE ]
hashE: [ <>, <>, <>, <>, <>, <>, [ <35>, 'coins' ], <>, <>, <>, <>, <>, <>, <>, <>, <>, 'puppy' ]
Explanation
-
rootHash
is an Extension node, where itsencodedPath
is[ 0x6 ] => [ '16' ]
. -
hashA
is a Branch node, where its child at Index4
points tohashB
and child at Index8
points to a Leaf node.Note: this Branch node does not have a value.
-
[ <20 6f 72 73 65>, 'stallion' ]
is a Leaf node which is pointed by the child at Index8
inhashA
,where its
encodedPath
is[ 0x6, 0xf, 0x7, 0x2, 0x7, 0x3, 0x6, 0x5, 0x10 ] => [ '20', '6f', '72', '73', '65' ]
. -
hashB
is an Extension node, where itsencodedPath
is[ 0x6, 0xf ] => [ '00', '6f' ]
. -
hashC
is a Branch node, where its child at Index6
points tohashD
.Note: this Branch node has a value.
-
hashD
is an Extension node, where itsencodedPath
is[ 0x7 ] => [ '17' ]
. -
hashE
is a Branch node, where its child at Index6
points to a Leaf node.Note: this Branch node has a value.
-
[ <35>, 'coins' ]
is a Leaf node which is pointed by the child at Index6
inhashE
,where its
encodedPath
is[ 0x5, 0x10 ] => [ '35' ]
.
Moreover, when one node is referenced inside another node, what is included is
\[H := \begin{cases} \, \text{Keccak-256}(x) & \text{if } \text{len}(x) \geq 32 \\ \, x & \text{if } \text{len}(x) < 32 \end{cases} \quad \text{where} \quad x := \text{RLP}(\text{node})\]Trie Types
-
State Trie: stores mappings between account addresses and account states.
path := keccak256(ethereumAddress)
andvalue := rlp(ethereumAccount)
where account state is a 4 element array
[nonce, balance, storageRoot, codeHash]
-
Storage Trie: stores extra mappings between account addresses and all related contract data.
-
Transaction Trie: stores transactions included in a block.
-
Receipts Trie: stores receipts of transactions (logs and status).