Substitution Cipher & OTP Attack
Back to Projects
Cybersecuritycryptographyotpsecurity

Substitution Cipher & OTP Attack

Advanced substitution cipher implementation and analysis of One-Time Pad integrity vulnerabilities.

Date
2024
Category
Cybersecurity

Overview

Lab 2 explores more sophisticated classical cryptography with substitution ciphers and modern OTP systems. The project implements general substitution cipher encryption/decryption and demonstrates practical attacks on One-Time Pad implementations when integrity is not properly maintained. Includes cryptanalysis techniques and demonstrates why authentication is crucial in cryptographic systems.

Technologies Used

PythonCryptographyXOR OperationsCryptanalysisNLTKStatistical Analysis

Key Features

General substitution cipher with custom alphabets

Automatic frequency analysis for cipher breaking

One-Time Pad encryption implementation

OTP integrity attack demonstrations

Bit-flipping attack on OTP ciphertexts

Statistical analysis tools for cryptanalysis

Showing 6 of 8 features

Implementation Details

1

What is a Substitution Cipher?

A substitution cipher is a method of encryption where each letter in the plaintext is replaced with another letter according to a fixed mapping. Unlike the shift cipher which uses a simple offset, substitution ciphers use a complete one-to-one mapping of the alphabet.

For example, the mapping might be:

With this mapping, "HELLO" would become "ITSSG". The key space is enormous - there are 26! (approximately 4 × 10²⁶) possible mappings, making brute force attacks impractical. However, substitution ciphers are vulnerable to frequency analysis because the statistical properties of the language are preserved in the ciphertext.

2

What is Frequency Analysis?

Frequency analysis is a cryptanalysis technique that exploits the fact that certain letters appear more frequently than others in natural language. In English, 'E' is the most common letter (12.7%), followed by 'T' (9.1%), 'A' (8.2%), and so on.

The attack works by:

  • Counting the frequency of each letter in the ciphertext
  • Comparing these frequencies to known English letter frequencies
  • Creating an initial mapping by matching high-frequency ciphertext letters to high-frequency plaintext letters
  • Iteratively improving the mapping by testing letter swaps and validating against an English dictionary
  • Scoring each attempt based on how many valid English words are produced

This technique is remarkably effective because the statistical properties of language are robust - even with a random substitution, the frequency distribution remains recognizable.

3

Loading and Analyzing Text

The first step in breaking a substitution cipher is to load the encrypted text and analyze its letter frequencies:

Code Explanation

The code provides three key functions: 1. **load_text()**: Reads the encrypted file and converts it to uppercase for consistent analysis. Error handling ensures the file exists and is readable. 2. **get_letter_frequency()**: Uses regex to extract only letters (ignoring punctuation and spaces) and counts their occurrences using Python's Counter class. This gives us the frequency distribution of the ciphertext. 3. **get_english_frequency()**: Returns a dictionary of known English letter frequencies based on linguistic analysis. These percentages represent how often each letter appears in typical English text. 'E' at 12.70% is the most common, while 'Z' at 0.07% is rare.

4

Creating Initial Mapping

Once we have the frequency data, we create an initial mapping by matching letters with similar frequencies:

Code Explanation

**create_initial_mapping()** implements the core frequency matching algorithm: 1. Both frequency distributions are sorted in descending order (most common first) 2. The most frequent ciphertext letter is mapped to 'E' (most common in English) 3. The second most frequent is mapped to 'T', and so on 4. This creates a reasonable initial guess based purely on statistics **decrypt_text()** applies the mapping to transform ciphertext: - Iterates through each character - Uppercase letters are substituted according to the mapping - Non-letter characters (spaces, punctuation) are preserved unchanged - This maintains the structure and readability of the original text

5

Optimization with Hill Climbing

The initial frequency-based mapping is rarely perfect. We use a hill-climbing algorithm to iteratively improve it by testing letter swaps and validating against an English dictionary:

Code Explanation

This optimization process uses three key components: 1. **get_english_words()**: Loads a comprehensive English dictionary using NLTK (Natural Language Toolkit). This provides ~200,000 words for validation. 2. **score_decryption()**: Calculates the quality of a decryption by: - Extracting all words from the decrypted text - Checking how many exist in the English dictionary - Returning the ratio (0.0 to 1.0, where 1.0 means all words are valid) 3. **find_best_mapping()**: Implements a hill-climbing optimization: - Starts with the frequency-based initial mapping - For each iteration, tries swapping every possible pair of letters (325 combinations) - If a swap improves the score, it's kept permanently - Continues until no swap improves the score (local maximum reached) - Returns both the best mapping and its score This approach typically converges quickly because valid English text has a much higher score than gibberish, creating a strong gradient toward the correct solution.

6

What is OTP (One-Time Pad)?

The One-Time Pad (OTP) is a theoretically unbreakable encryption method when used correctly. It was invented in 1917 and remains the only encryption system proven to be mathematically unbreakable. The security guarantee relies on three critical requirements:

  • The key must be truly random (not pseudo-random)
  • The key must be at least as long as the message
  • The key must never be reused for any other message

OTP uses XOR (exclusive OR) operation for both encryption and decryption. The beauty of XOR is that it is self-inverse: applying XOR twice with the same key returns the original value. This makes the encryption and decryption processes identical:

When implemented perfectly, OTP provides perfect secrecy - the ciphertext reveals absolutely no information about the plaintext (except its length). However, this security guarantee ONLY applies to confidentiality, not integrity. Without authentication, an attacker can modify the ciphertext to produce predictable changes in the plaintext.

7

How OTP Works

Let's implement OTP encryption and decryption in Python. The XOR operation works at the byte level, making it suitable for any binary data:

Code Explanation

The implementation consists of three key functions: 1. **XOR(a, b)**: The core operation that performs bitwise XOR on two byte sequences: - Uses `zip()` to iterate through pairs of bytes simultaneously - The `^` operator performs XOR at the bit level - Converts each result byte back using `.to_bytes(1, "big")` - Returns the result as a byte string 2. **encrypt_otp()**: Encryption is simply XOR of plaintext and key: - Validates that the key is long enough (critical security requirement) - Calls XOR with the plaintext and appropriate key slice - Returns the ciphertext 3. **decrypt_otp()**: Decryption is identical to encryption: - Also performs XOR operation (since XOR is self-inverse) - Uses the same key that was used for encryption - Returns the original plaintext The example demonstrates that encryption and decryption are symmetric operations. The hex representation of ciphertext shows it appears random, providing confidentiality. However, note that without integrity protection, this ciphertext can be manipulated.

8

Compromising OTP Integrity

While OTP provides perfect confidentiality, it provides NO integrity protection. This means an attacker who intercepts the ciphertext can make predictable modifications to the plaintext without knowing the key. This is called a bit-flipping attack or malleability attack.

The attack exploits the mathematical properties of XOR. If we know the original plaintext (or part of it) and want to change it to different text, we can compute the difference and apply it to the ciphertext:

Here's a practical implementation of this attack. Imagine a scenario where we know a message says "Student ID 1005510 gets 3 points" and we want to change it to "Student ID 100XXXX gets 4 points" without knowing the encryption key:

Code Explanation

This attack demonstrates a critical weakness in OTP without authentication: **The Attack Process:** 1. **Known plaintext**: We know (or can guess) the original message structure: "Student ID 1005510 gets 3 points" 2. **Desired plaintext**: We want to change it to: "Student ID 100XXXX gets 4 points" - Changed the student ID digits - Changed "3 points" to "4 points" 3. **Calculate difference**: ``` difference = original_plaintext ⊕ desired_plaintext ``` This tells us exactly which bits need to flip to transform the original into our desired text. 4. **Modify ciphertext**: ``` new_cipher = original_cipher ⊕ difference ``` By XORing the ciphertext with our difference, we flip exactly the bits needed. 5. **Result**: When the server decrypts `new_cipher` with their key, they get our `desired` plaintext. **Why This Works:** - We never needed to know the encryption key - We only needed to know (or guess) the original plaintext structure - The XOR operation allows us to flip specific bits predictably - The server has no way to detect the ciphertext was modified **The Lesson**: OTP provides confidentiality (perfect secrecy) but NOT integrity. Real-world systems must combine OTP with authentication mechanisms like HMAC to prevent tampering. This is why modern cryptography uses authenticated encryption schemes like AES-GCM that provide both confidentiality and integrity.

Challenges

Implementing efficient frequency analysis algorithms, understanding the mathematical foundations of OTP security, demonstrating practical attacks without breaking ethical boundaries.

Outcome & Impact

Successfully broke substitution ciphers using frequency analysis, demonstrated OTP vulnerabilities when integrity is compromised, highlighting the importance of authenticated encryption.

How I Grew

Deep understanding of substitution cipher mechanics and vulnerabilities

Learned why One-Time Pad requires both confidentiality and integrity

Mastered frequency analysis and statistical cryptanalysis

Understood the importance of message authentication codes

Gained insights into authenticated encryption schemes

Try It Yourself

Decoded output will appear here...

The algorithm uses frequency analysis and hill climbing optimization to find the most likely mapping. Longer texts with more common English words will produce better results. Click "Load example" to try with sample encrypted text!