To crack hashes, practitioners use large word lists containing likely password candidates. They are used with different attack types, such as rules, to recover the plaintext values.
Not surprisingly, the best word lists come from actual passwords, as the human element in setting passwords tends to permeate through to create predictable patterns that are often targeted. For example, people are far less likely to add numbers to the beginning than to the end of a password. While people still set passwords starting with digits, it is less statistically common than at the end.
In this post, I will introduce my methodology for creating new password-cracking word lists and benchmark them against other popular ones.
Extract, Transform, and Load
I dumped all the cracked hashes on my password archive server to get started. We will work with ~958 million passwords for this test. To get the best results possible, I wanted to filter out any bad patterns before getting started.
# dump passwords
$ wc -l plain-passwords.lst
958104442 plain-passwords.lst
# remove low quality items
$ cat plain-passwords.lst | grep -vE 'http:\/\/https:\/\/|\@\.com|\@\.ru|\@\.cn|\@\.org|\@.*\.net|<tr>|<div>|<a href|<p>|<img src|\$HEX\[|fbobh_|\@mail|\@msn|\@aol|\@yahoo|\@gmail|\@hotmail' | grep -v '[^[:print:]]' > prepped-passwords.lst
$ wc -l prepped-passwords.lst
924073743 prepped-passwords.lst
This dropped the total count to ~924 million, which is quite a lot, but because we are making word lists, any quality filtering will go a long way.
What’s Behind The Mask?
One of my favorite strategies for creating word lists is to use common password masks to filter word lists for candidates. This way, we can avoid seemingly “random” passwords and keep the best-quality candidates together.
To do this, I used a tool called ptt which can turn plaintext passwords into hashcat masks with additional metadata such as complexity and length.
# make password masks
$ cat prepped-passwords.lst | ptt -t mask -v > mask-passwords.lst
$ wc -l mask-passwords.lst
920747753 mask-passwords.lst
# aggregate masks
$ cat mask-passwords.lst | sort -T ./ | uniq -c | sort -T ./ -rn > sorted-mask-passwords.lst
$ head -n 5 sorted-all-masks.txt
83399314 ?l?l?l?l?l?l?l?l:8:1
15711147 ?l?l?l?l?l?l?l?l?l?l:10:1
14774094 ?d?d?d?d?d?d?d?d:8:1
14189820 ?d?d?d?d?d?d?d?d?d?d:10:1
13469846 ?l?l?l?l?l?l?l?l?l:9:1
Doing some math on the results, if we took just the top 5,000 masks, we would cover around 86.8% of the plaintext passwords. This is exciting because we can gain additional speed and performance by leaning out the word list to the most likely candidates.
Next up, let us collect all the dumped passwords and any previously made word lists to ensure we have complete coverage.
# get all the word sources
$ wc -l wordlist1.lst
363572796 wordlist1.lst
$ wc -l wordlist2.lst
445966458 wordlist2.lst
$ wc -l dumped-passwords.lst
924073743 dumped-passwords.lst
# file sizes
$ ll | grep lst
3.8G -rwxrwxrwx 1 jw jw 3.8G Jul 11 21:01 wordlist1.lst
5.1G -rwxrwxrwx 1 jw jw 5.1G Jul 11 21:11 wordlist2.lst
9.7G -rwxrwxrwx 1 jw jw 9.7G Jul 5 21:46 dumped-passwords.lst
We need to get the top 5,000 password masks from sorted-all-masks.txt, which will cover most of the database’s passwords. We also have metadata from ptt that we can use to make even more specific word lists.
After removing the item count from sorted-all-masks.txt, we can use a regex to filter the masks, looking for items that are greater than or equal to eight (8) characters and between three (3) and four (4) complexity.
# top 5k masks overall
$ head top-5k-masks.txt
?l?l?l?l?l?l?l?l
?l?l?l?l?l?l?l?l?l?l
?d?d?d?d?d?d?d?d
?d?d?d?d?d?d?d?d?d?d
?l?l?l?l?l?l?l?l?l
?l?l?l?l?l?l?d?d
?l?l?l?l?l?l?l?l?d?d
?l?l?l?l?l?l?l
?l?l?l?l?l?l?d?d?d?d
?l?l?l?l?d?d?d?d
# masks from top 5k that meet above requirements
$ $ cat clean-sorted-3to4-complexity-mask-passwords.lst | grep -vE ':3:3$|:4:3$|:4:4$|:5:3$|:5:4$|:6:3$|:6:4$|:7:3$|:7:4$' > clean-sorted-3to4-complexity-ge8-len-mask-passwords.lst
$ head top-5k-3to4ge8-masks.txt
?u?l?l?l?l?l?d?d
?u?l?l?l?l?l?d?d?d?d
?u?l?l?l?l?d?d?d?d
?u?l?l?l?d?d?d?d
?u?l?l?l?l?l?l?d?d
?u?l?l?l?l?l?l?l?d?d
?s?l?l?l?d?d?d?d?d?d
?u?l?l?l?l?l?d?d?d
?u?l?l?l?l?d?d?d
?u?l?l?l?l?l?l?d
Now we take the word lists and push them through ptt to match entries that match the most popular masks. This will help slim down word lists to the most probable entries.
# making a wordlist
$ cat wordlist1.lst | ptt -t match -tf top-5k-masks.txt > top5kmaskswords.lst
$ cat wordlist1.lst | ptt -t match -tf top-5k-masks.txt > top5kmaskswords-2.lst
$ cat dumped-passwords.lst | ptt -t match -tf top-5k-masks.txt > top5kmaskswords-3.lst
# making a complex wordlist
$ cat wordlist1.lst | ptt -t match -tf top-5k-3to4ge8-masks.txt > top5kmaskswords-3to4ge8.lst
$ cat wordlist1.lst | ptt match -tf top-5k-3to4ge8-masks.txt > top5kmaskswords-3to4ge8-2.lst
$ cat dumped-passwords.lst | ptt -t match -tf top-5k-3to4ge8-masks.txt > top5kmaskswords-3to4ge8-3.lst
# sample of a list
$ head top5kmaskswords-3to4ge8-3.lst
$01april
$01august
$01August
$01autumn
$01Autumn
$01december
$01february
$01january
$01march
$01november
After getting all the results, we can combine the files and remove duplicate values. After everything was finished, we took around ~16 GB (the size of everything, deduplicated) down to ~12 GB, which is around 75% of the original size.
# final sizes
$ wc -l top-5k-masks-3to4ge8.lst
155719008 top-5k-masks-3to4ge8.lst
$ wc -l top-5k-masks.lst
1187881713 top-5k-masks.lst
$ ll
1.7G -rwxrwxrwx 1 jw jw 1.7G Jul 13 22:06 top-5k-masks-3to4ge8.lst
12G -rwxrwxrwx 1 jw jw 12G Jul 13 23:19 top-5k-masks.lst
Now there is a choice to make. The options are:
- Leave the word lists as they are, with duplicates between each other.
- Deduplicate word lists between each other.
Both have advantages. By leaving the word lists as they are, you can be sure that you have great coverage at the risk of running duplicates. If you opt to remove duplicates, you will remove the risk of running duplicates, but you may miss out on coverage unless you run both.
For this test, we will remove duplicates between the two with rli.bin. We also add all unmatched items to their own word list to preserve the data. We will take the top 5k masks and reduce them by the top 5k masks with complexity. This way, the smaller complexity list retains its size, and the larger list is reduced.
# syntax
$ rli.bin -h
usage: rli.bin infile outfile removefiles...
# if the files are too large try splitting
split -n 2 file.lst
# start by removing entries from the top 5k masks passwords
$ rli.bin top-5k-masks.lst reduced-top-5k-masks.lst top-5k-masks-3to4ge8.lst
# then remove matched entries from the unmatched entries
$ rli.bin remainder.lst reduced-remainder.lst reduced-top-5k-masks.lst top-5k-masks-3to4ge8.lst
Lets check out the sizes of the final wordlists:
$ ll
3.5G -rwxrwxrwx 1 jw jw 3.5G Jul 14 14:28 final-remainder.lst
1.7G -rwxrwxrwx 1 jw jw 1.7G Jul 13 22:06 final-top-5k-mask-3to4ge8-passwords.lst
11G -rwxrwxrwx 1 jw jw 11G Jul 14 13:19 final-top-5k-mask-passwords.lst
$ wc -l final*
218014370 final-remainder.lst
155719008 final-top-5k-mask-3to4ge8-passwords.lst
1057925799 final-top-5k-mask-passwords.lst
1431659177 total
We removed ~3.5 GB of unlikely candidates from their own word list and shaved off ~2.4 GB in duplicate values, leaving three newly optimized word lists.
The Results
To best measure effectiveness, we will split the wordlists into a few different sizes using the same methods above:
| Wordlist | Size | Line Count |
|---|---|---|
top5kmasks.lst |
11GB | 1,030,877,000 |
top15masks.lst |
2.9GB | 307,605,705 |
top5masks.lst |
1.6GB | 175,981,061 |
top5kmasks-c8.lst |
1.3GB | 123,148,699 |
top15masks-c8.lst |
229MB | 23,745,089 |
top5masks-c8.lst |
120MB | 12,849,424 |
top5kmasks-c8l.lst |
1.3GB | 123,148,699 |
top15masks-c8l.lst |
229MB | 23,745,089 |
top5masks-c8l.lst |
120MB | 12,849,424 |
top22masks-nd.lst |
3.4GB | 366,420,774 |
leftovers.lst |
3.4GB | 210,183,312 |
To break down the differences:
no-ending: no special filtering, just the topxmasksc8: filtered for complexity and length greater than or equal to eight (8)c8l: same asc8but everything in lowercasend: same asno-endingbut skipping masks that are 100% digitsleftovers: list containing all the non-matched items
For testing, we will be using a rules and word list document created by PenguinKeeper and others to get a standard process and benchmark. A table with complete information will be provided at the bottom of the article.
The following summarizes the results:
Wordlist | Cracked | Cracked % | Size (MB) | Keyspace
--------------------------------------------|-----------|-------------|-------------|-------------
top5kmasks.lst * | 3731088 | 71.765 | 10835 | 1030877000
rockyou2021 (news ref) | 2121189 | 40.799 | 98378 | 8459060239
weakpass_2a | 2023304 | 38.917 | 91742 | 7884602871
hashesorg2019 (weakpass) (Old) | 1989619 | 38.269 | 13733 | 1279729109
hashes.org-2012-2019 (Old) | 1985107 | 38.182 | 13639 | 1270725606
DicAssv1 | 1818202 | 34.972 | 216730 | 16141112024
weakpass_2 | 1787871 | 34.388 | 30542 | 2649982129
kaonashi | 1715915 | 33.004 | 9753 | 866508697
ALM(PasswdOnly)(freq_sorted) | 1692704 | 32.558 | 7732 | 640591900
foordeluxes | 1669987 | 32.121 | 9792 | 891071188
hibpv6 | 1597316 | 30.723 | 10364 | 892631604
hibpv5 | 1583437 | 30.456 | 10171 | 875298829
hibpv4 | 1577954 | 30.351 | 10131 | 871534311
hibpv3 | 1565746 | 30.116 | 9320 | 837438728
weakpass_1 | 1553174 | 29.874 | 37008 | 3130162774
hibpv2 | 1552743 | 29.866 | 9157 | 821876827
hashes.org-2019 | 1548285 | 29.780 | 5513 | 522172105
WHYPHY2 (Not public) | 1542764 | 29.674 | 2544 | 241084970
top22masks-nd.lst * | 1545697 | 29.730 | 3560 | 366420774
cyclone_hk | 1517933 | 29.196 | 2624 | 257823994
foordeluxestuff | 1445497 | 27.803 | 5373 | 482278969
Top2Billion-probable-v2 | 1431415 | 27.532 | 21745 | 1973218843
breachcompilation | 1420955 | 27.331 | 9641 | 1012022949
b0n3z-sorted-wordlist | 1416938 | 27.254 | 74512 | 7867573012
b0n3z | 1395994 | 26.851 | 34640 | 3113289498
hibpv1 | 1362149 | 26.200 | 3544 | 320294199
hashes.org-2018 | 1357062 | 26.102 | 6430 | 475531709
HashesOrg (weakpass) | 1275418 | 24.532 | 4457 | 446426190
DCHTPassv1.0 | 1274182 | 24.508 | 24524 | 3072260790
Md5decrypt-awesome-wordlist | 1207631 | 23.228 | 21083 | 1844826117
top15masks.lst * | 1181174 | 22.719 | 3020 | 307605705
Nummer_DB | 1176964 | 22.638 | 2416 | 202783735
only_latin | 1175667 | 22.613 | 2318 | 198098375
antipublic | 1148475 | 22.090 | 1919 | 189640017
unique_usernames | 1144898 | 22.021 | 16874 | 1246520259
Top353Million-probable-v2 | 1099481 | 21.148 | 3788 | 353330260
CoinWordlist | 1087705 | 20.921 | 1239 | 107661196
hashes.org-2015 | 1073228 | 20.643 | 3253 | 343103178
passw_from_logs | 1047760 | 20.153 | 3035 | 222339592
EvilGhost | 976476 | 18.782 | 100932 | 10579628569
elackops | 949442 | 18.262 | 1270 | 102548616
passcape_comp | 932728 | 17.940 | 8204 | 616095654
InsideProFull | 904516 | 17.398 | 1612 | 154045162
ASLM(freq_sorted) | 899640 | 17.304 | 503 | 41591035
ASLM(freq_sorted) cleaned | 897858 | 17.270 | 397 | 39096069
uniq | 896504 | 17.244 | 2662 | 243779397
Top109Million-probable-v2 | 890325 | 17.125 | 1142 | 109438614
passwords_collection | 889102 | 17.101 | 2639 | 241584732
HyperionOnHackForumsNetRELEASE | 889102 | 17.101 | 2639 | 241584732
crackstation | 888963 | 17.099 | 15696 | 1212336035
top5kmasks-c8l.lst * | 888222 | 17.084 | 1376 | 123148699
wordlist_by_Kakoluk | 883330 | 16.990 | 5069 | 445871442
MIX_logins-email-2016 | 867838 | 16.692 | 8432 | 623974701
hashes.org-2017 | 845231 | 16.257 | 3546 | 324025149
hashes.org-2016 | 781397 | 15.030 | 1177 | 102117059
18_in_1 | 773850 | 14.884 | 39099 | 5343785797
clem9669_wordlist_large | 772415 | 14.857 | 14082 | 1113453393
kac | 758212 | 14.584 | 1810 | 170422706
top5kmasks-c8.lst * | 746127 | 14.351 | 1376 | 123148699
Super_mega_dic | 732605 | 14.091 | 2891 | 212443106
MegaCracker | 714010 | 13.733 | 1710 | 148615152
kaonashi14M | 690846 | 13.288 | 138 | 14344391
ignis-10M | 652217 | 12.545 | 94 | 10000000
Top29Million-probable-v2 | 617523 | 11.878 | 299 | 29040646
hk_hlm_founds | 604118 | 11.620 | 408 | 38647791
rp4 | 577447 | 11.107 | 509 | 47688304
Wordlist_82_million | 571517 | 10.993 | 553 | 62619507
lolwtfhax | 571517 | 10.993 | 553 | 62619507
clem9669_wordlist_medium | 554971 | 10.674 | 3133 | 193661571
SmolDick | 528213 | 10.160 | 626 | 40163196
clem9669_wordlist_small | 527078 | 10.138 | 511 | 45054002
realhuman | 508497 | 9.781 | 716 | 63941069
MECA_Passlist | 508497 | 9.781 | 716 | 63941069
hashkiller-dict | 508287 | 9.777 | 253 | 23685601
eNtr0pY_ALL_sort_uniq | 508228 | 9.775 | 914 | 83653572
hashash.in | 489853 | 9.422 | 221 | 22777141
top15masks-c8l.lst * | 463822 | 8.921 | 240 | 23745089
mathway | 428797 | 8.248 | 167 | 16498019
14-million-pass - Screetsec | 420958 | 8.097 | 140 | 14344384
rockyou | 420944 | 8.097 | 140 | 14344359
hashkiller-dict | 407130 | 7.831 | 224 | 18439169
the_best | 398508 | 7.665 | 186 | 17532884
random_social_usernamesupd | 361512 | 6.953 | 1908 | 154463897
M3G_THI_CTH_WORDLIST_CLEANED | 358453 | 6.895 | 177 | 15738781
passwords | 355047 | 6.829 | 194 | 15851426
top5masks.lst * | 338345 | 6.508 | 1691 | 175981061
clem9669_wordlist_large | 331212 | 6.371 | 8375 | 765991502
clem9669_wordlist_medium | 327430 | 6.298 | 1175 | 82065889
dna | 318152 | 6.119 | 168 | 18216183
livejournal (new ref) | 312965 | 6.020 | 216 | 20266972
clem9669_wordlist_small | 306172 | 5.889 | 147 | 13953734
000webhost | 295661 | 5.687 | 132 | 10620225
ignis-1M | 288167 | 5.543 | 9 | 1000000
leftovers.lst * | 279328 | 5.373 | 3607 | 210183312
top5masks-c8l.lst * | 277234 | 5.332 | 126 | 12849424
SkullSecurityComp | 275411 | 5.297 | 72 | 6693327
mega_slovar | 260894 | 5.018 | 336 | 31630758
Hashkiller.com-Nilix_Collection | 235919 | 4.538 | 235 | 22738835
collect_from_logs | 212141 | 4.080 | 192 | 12560275
top15masks-c8.lst * | 201255 | 3.871 | 240 | 23745089
dazzlepod | 151906 | 2.922 | 20 | 2151235
xsplit | 147851 | 2.844 | 9 | 939014
xato-net-10-million-passwords-1000000 | 138406 | 2.662 | 9 | 1000000
10_million_password_list_top_1000000 | 137453 | 2.644 | 9 | 1000000
xato-net-10-million-usernames | 127314 | 2.449 | 85 | 8295455
Hashkiller.com-Wordlists_compilation | 113430 | 2.182 | 107 | 10241373
top5masks-c8.lst * | 109147 | 2.099 | 126 | 12849424
opencrack_plains_2009 | 94288 | 1.814 | 31 | 3046096
vb_passwords | 88290 | 1.698 | 7 | 750449
Argon_Wordlist_v2 | 87312 | 1.679 | 2011 | 227784242
under1000k | 83506 | 1.606 | 11 | 902748
Hashkiller.com-Silver_small_Wordlist | 82705 | 1.591 | 30 | 3256289
Top304Thousand-probable-v2 | 80643 | 1.551 | 3 | 303872
vkontakte | 78740 | 1.515 | 7 | 697404
Hashkiller.com-Common_passes | 75835 | 1.459 | 15 | 1507155
ignis-100K | 66511 | 1.279 | 1 | 100000
Application
Overall, the results show this strategy is an effective way of creating new word lists using cracked passwords. I would implement it into any process where you want to develop word lists targeting specific password policy requirements for plaintext attacks. For the purpose of transforming base words into candidates, rules would be preferred.
Thanks for reading.