Skip to main content
  1. Posts/

Optimizing Rules with Entropy

·8 mins
passwords hashcat hashcracking

The initial publication was on 1-23-2022 and republished on 12-23-2022.

Note: Password data mentioned in this article was obtained through public resources to improve overall password security posture. Information shared in public breaches helps improve security recommendations.

In my town, we had a game called the “Squid Game.”>

In my town, we had a game called the “Squid Game.” #

One of the most common attack types is the rule-based attack when cracking passwords. In short, rules are instructions to modify plaintext values in predictable ways to generate many candidates quickly.

For example, the following Hashcat rules will take candidates and then append numbers to them:

# rules:
$1
$2
$1 $2 $3

# plaintext:
Password
Mypass

# result:
Password1
Password2
Password123
Mypass1
Mypass2
Mypass123

In practice, this works really well, as passwords are often set in similar patterns, and you can take a single candidate and try multiple variations quickly. Using rules also sends work to the GPU, an efficient method of generating candidates on the fly.

There are many ways to make rules, such as raking, which uses random rule generation to create rulesets. Still, there are thousands of pre-written rules out there that cover a lot of patterns password crackers have identified over time. Drowning in a sea of rules, there is much debate about the best-performing rules and which ones should be used.

One project that optimized the released rulesets was  OneRuleToRuleThemAll which took the top 25% of rules from 14 well-known rule sets. This created a robust set of ~52,000 rules that aimed to reduce the noise and get more cracks in less time.

The ruleset created is proven to be effective, but a few lingering questions remain:

  • The rules were separated by their source, and then the top 25% best performing rules were taken from each. While this makes sense, what happens if one source significantly outperforms another? Should the top 25% be taken from both sources if one set cracked vastly more? Are there better rules being left behind?
  • The testing process was repeated manually for each source. How would this process scale? What if we had hundreds of sources? What if the rulesets were vastly different sizes? Does much does the multi-threading of the GPU affect the results?

These questions sought to be answered with a high entropy enviornment where rules would have to prove their value in a free-for-all against other rules. Several factors of testing were shuffled such as random hash samples, shuffled rulesets, and shuffled wordlists so that each round would be sufficently random. This free for all style process aimed to introduce a scaleable model to sort rules directly against another.

We can use the  HaveIBeenPwned database for an extensive sample for our testing. To flush out a proof-of-concept, we will take the top 425 million passwords (HIBPv7) sorted by popularity for training. This is around ~69% of the total passwords, so well within one standard deviation of the total. Then we can use the remaining passwords for testing data to limit polluting data.

We will also want to add entropy to our rules and wordlists so that their order does not factor in the results and give rules that generate the same plaintext an opportunity to score points. For resources, we can use the RockYou wordlist (in HIBP, but that’s okay) and 1.5 million rules from various public repositories.

The following chart describes the process we want to achieve:

graph TD A[RockYou.txt] --> Loop[Start Round] B[All-Rules.rule] --> Loop[Start Round] C[All-Hashes.left] --> Loop[Start Round] D[Start Round] -->| Shuffle Resources| E(Sample Wordlist & Rules) E --> F{Execute Hashcat} F -->| Capture Debug| G{Aggregate} G --> H{Parse Top Rules} subgraph Loop [Loop] D E F end

This whole process will be orchestrated by a bash script:

#!/bin/bash

# run a loop and shuffle a hashcat task to find the top performing rules using rounds of randomized performance

for i in {1..100}
do
  cat rules.rule | shuf | sponge rules.rule
  cat rockyou.txt | shuf | head -n 1000000 > sample-1m-wordlist.txt
  cat top-425m-hibp.txt | shuf | head -n 10000000 > sample-10m-hibp.txt
  hashcat -a 0 -m 1000 sample-10m-hibp.txt sample-1m-wordlist.txt --debug-mode=1 --debug-file=debug.txt --potfile-disable --quiet --runtime=2400 -r rules.rule -O
  cat debug.txt > ./attempts/a-$i.txt
  rm debug.txt
done

# note count will still be in file
cat ./attempts/*.txt | sort | uniq -c | sort -rn > rule-w-count.rule
The Results>

The Results #

After 100 rounds, we have a battle-tested ruleset ready to be compared against the holdout set we saved from before. The testing set is taken from the bottom 185 million passwords of HaveIBeenPwnedv7, and we can take a 1 million hash sample in the interest of time.

By taking the testing set from a different data pool, we can attempt to eliminate some “overfitting” caused by using the same data pool for testing and training. To compare the rule sets equally, we want to ensure that the searched keyspace is equal, so we will take the top 51,998 rules out of our sorted ruleset to match OneRuleToRuleThemAll.

To test, we will use the RockYou wordlist one final time in a straight attack mode:

hashcat -m 1000 -a0 hash-sample rockyou.txt -r squid52k.rule

After some time, the results are in:

Rule Cracked Cracks Not in Other Remaining
OneRuleToRuleThemAll 389,158 (38.92%) 54,153 (13.92%) 610,842 (61.08%)
squid52k 432,183 (43.22%) 97,178 (22.49%) 567,817 (56.78%)
OneRuleToRuleThemAll Results:>

OneRuleToRuleThemAll Results: #

Squid Rule Results:>

Squid Rule Results: #

Interesting results suggest that the method is viable. While this is only one test against a selection, it shows there is merit to the process and offers a scalable solution to sorting rules. While OneRuleToRuleThemAll is ~52k rules, the process sorted +1 million rules for squid rules.

It is also worth noting that using both rules would result in the most cracks (48.63%), and inspecting the differences between the two show that squid rules contain few rules from OneRuleToRuleThemAll.

# comparing the two rules with each other
# 22 shared
cat OneRuleToRuleThemAll.rule | anew -d squid52k.rule | wc -l
51976

# comparing the top 1 million rules of SquidRule
# 853 shared
cat OneRuleToRuleThemAll.rule | anew squid1m.rule | wc -l 
51145

This is great for a first test, but let’s go ahead and test again, using actual breach data and a more standardized system for scoring. For this test, we will use this  document created by PenguinKeeper to assist in scoring our test rules against others.

To do so we can split the rules into a few different sizes:

  • squid500.rule
  • squid5k.rule
  • squid50k.rule
  • squid250k.rule
  • squid500k.rule
  • squid1m.rule
Comparing Results:>

Comparing Results: #

From the results, we can clearly see a few concerns. On the larger end of the spectrum, the crafted squid rules compete well with other large rule sets, suggesting at a certain size throwing the kitchen sink works. However, the rules could be more competitive on the smaller end of the spectrum. This test indicates that an overfit happened with the previous test and that this strategy might help find a “middle ground” from large rule collections.

Raking Approach>

Raking Approach #

Raking could also be added to the process. This would add randomly generated rules between runs that would compete with existing rules. This would theoretically combine the value of sorting existing rules and raking into a single process that can be left idle.

The following chart describes the new process:

graph TD A[RockYou.txt] --> Loop[Start Round] B[All-Rules.rule] --> Loop[Start Round] C[All-Hashes.left] --> Loop[Start Round] D[Start Round] -->| Shuffle Resources| E(Sample Wordlist & Rules) E --> F{Execute Hashcat} F -->| Capture Debug| G{Aggregate} G --> Rake[Rake] H{Execute Hashcat} --> I{Aggregate} I -->|Add Raked| B G --> Z{Parse Top Rules} subgraph Loop [Loop] D E F G subgraph Rake [Rake] H I end end

The modified bash script:

#!/bin/bash

# run a loop and shuffle a hashcat task to find the top performing rules using rounds of randomized performance

for i in {1..100}
do
  cat rules.rule | shuf | sponge rules.rule
  cat rockyou.txt | shuf | head -n 1000000 > sample-1m-wordlist.txt
  cat top-425m-hibp.txt | shuf | head -n 10000000 > sample-10m-hibp.txt
  hashcat -a 0 -m 1000 sample-10m-hibp.txt sample-1m-wordlist.txt --debug-mode=1 --debug-file=debug.txt --potfile-disable --quiet --runtime=2400 -r rules.rule -O
  cat debug.txt > ./attempts/a-$i.txt
  rm debug.txt
  hashcat -a 0 -m 1000 sample-10m-hibp.txt sample-1m-wordlist.txt --debug-mode=1 --debug-file=debug.txt --potfile-disable --quiet --runtime=300 -g 5000 -generate-rules-func-min=2 --generate-rules-func-max=6 -O
  cat debug.txt | anew -q rules.rule
  rm debug.txt
done

# note you still have to manually clean up the result file
cat ./attempts/*.txt | sort | uniq -c | sort -rn > rule-w-count.rule
Application>

Application #

This experiment showed the value of creating custom workflows and resources to improve hash-cracking methodologies. Thinking about applications for this technique is in cases where you want to filter down massive amounts of rules into a select subset for other uses. This could be running your own long-term raking operation or narrowing down rules to make your own rule sets. I would not rely on this method to create precise rule sets but rather filter out “bad” rules.

The rule created and other items can be found here.