brute forcer

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 10:30 am

I ran some tests on my UNI processor machine, they should become faster on a SMP machine but at the moment the remote box I can test on is down.

1 Thread : 6 Byte Message
kmcguire@localhost ~ $ gcc md5.c -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 1
initializing..
digest_target is 0dbca6d6a29406b1693a0ed9033a8f64
cm:abcdefghijklmnopqrstuvwxyz
running
time: 66.130000
h/s:3.084658
(1) I found a collision!
message:hacker
g_thread_active_cnt:0

real 1m6.879s
user 1m6.055s
sys 0m0.089s

kmcguire@localhost ~ $ g++ brute-mt.cc -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 1
threadList[t].sequence[0]: 0
threadList[t].sequence[0]: 4
threadList[t].sequence[0]: 8
threadList[t].sequence[0]: 13
threadList[t].sequence[0]: 17
threadList[t].sequence[0]: 21
Collision Found!
hash[d6a6bc0db10694a2d90e3a69648f3a03] = 'hacker'
time: 75.23s
- avg. hash/s: 2827979.53 h/s
#done.

real 1m15.236s
user 1m11.300s
sys 0m0.160s

2 Thread : 6 Byte Message
kmcguire@localhost ~ $ gcc md5.c -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 2
initializing..
digest_target is 0dbca6d6a29406b1693a0ed9033a8f64
cm:abcdefghijklmnopqrstuvwxyz
running
cm:abcdefghijklmnopqrstuvwxyz
running
time: 32.280000
h/s:1.534403
(2) I found a collision!
message:hacker
g_thread_active_cnt:1

real 0m34.121s
user 0m32.174s
sys 0m0.114s

kmcguire@localhost ~ $ g++ brute-mt.cc -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 2
threadList[t].sequence[0]: 0
threadList[t].sequence[0]: 4
threadList[t].sequence[0]: 8
threadList[t].sequence[0]: 13
threadList[t].sequence[0]: 17
threadList[t].sequence[0]: 21
Collision Found!
hash[d6a6bc0db10694a2d90e3a69648f3a03] = 'hacker'
time: 74.65s
- avg. hash/s: 2846438.19 h/s
#done.

real 1m14.657s
user 1m11.332s
sys 0m0.139s

3 Thread : 6 Byte Message
kmcguire@localhost ~ $ gcc md5.c -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 3
initializing..
digest_target is 0dbca6d6a29406b1693a0ed9033a8f64
cm:abcdefghijklmnopqrstuvwxyz
running
cm:abcdefghijklmnopqrstuvwxyz
running
cm:abcdefghijklmnopqrstuvwxyz
running
time: 13.590000
h/s:1.021810
(3) I found a collision!
message:hacker
g_thread_active_cnt:2

real 0m13.788s
user 0m13.579s
sys 0m0.027s

kmcguire@localhost ~ $ g++ brute-mt.cc -lpthread -O3 -o md5 && time ./md5 d6a6bc0db10694a2d90e3a69648f3a03 6 6 3
threadList[t].sequence[0]: 0
threadList[t].sequence[0]: 4
threadList[t].sequence[0]: 8
threadList[t].sequence[0]: 13
threadList[t].sequence[0]: 17
threadList[t].sequence[0]: 21
Collision Found!
hash[d6a6bc0db10694a2d90e3a69648f3a03] = 'hacker'
time: 78.16s
- avg. hash/s: 2725642.17 h/s
#done.

real 1m18.163s
user 1m11.991s
sys 0m0.153s

I think this is attributed to:

A true splitting of the message space, instead of incremental steps.
A different core implementation. (slightly faster)

Okay. I got to have some fun...
When you have tried the rest you call the best.

os64dev · Post by **os64dev** » Fri Jul 06, 2007 10:47 am

sorry i can test now either. Intel Core Duo is at work and my AMD64 4400+ had a hardisk crash and i am still waiting for my replacement hardisk. I looked through your code and saw that you use clock for your timing. That does not work properly: use gettimeofday. Furthermore remove all printf's after the start timing. They are a majer influence on the accuracy. You have adapted the code to only passwords < 55 characters also?

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 10:49 am

I wrote a implementation of the md5_hash using MMX, but it was terribly slow. I did not further it much since I had started it in a attempt to speed up my multi-threaded implementation. When the remote box comes back online I can download it and post it. The implementation most likely has a bug in it since I did not check for it producing the correct hash but rather the overall timing of it to see if it was faster (even if flawed).

The overall machine instruction output seemed to be much larger in byte size, then the implementation we have been using and I can not say for sure right here if that was a cause for the slowdown.

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 10:57 am

os64dev wrote:sorry i can test now either. Intel Core Duo is at work and my AMD64 4400+ had a hardisk crash and i am still waiting for my replacement hardisk. I looked through your code and saw that you use clock for your timing. That does not work properly: use gettimeofday. Furthermore remove all printf's after the start timing. They are a majer influence on the accuracy. You have adapted the code to only passwords < 55 characters also?

That sounds like my other computer. I am only left with one that I use for testing. The (old) hard disk in it seems to have crashed and yesterday I could not get it to boot. No keyboard or mouse on it. Need to buy them for it so I can fix it.

Ah. I did not know that about clock(), even so though the actual output from the time command is the primary factor. I did decrease to less than fifty-five character messages (passwords?) because I felt that the number of possibilities would be astronomical and do not expect anyone to ever try to brute force something that large.

I understand too that you are talking about the program's own printing of the time. Yes, the is not completely accurate if due to the printf calls, and clock() not being accurate. I appreciate the information about clock(). I think I have run into that problem before when writing code to perform network packet throttling on a per client basis (I was having to track the bytes per second).

I can not remember if this was the function that produced bad results, or another (I am digging in my old code directories). Trying to post the good one (below).

Code: Select all

double getCurrentDoubleTime(void){
	double ctime;
	struct timespec ts;
	clock_gettime(CLOCK_REALTIME, &ts);
	ctime = (double)ts.tv_sec + ((double)ts.tv_nsec / (double)0xffffffff);
	return ctime;
}

It needs -lrt to compile if I remember correctly.

Also the problem with using these types of functions it that they acquire the actual real time, instead of the amount of time the process was executing. Such that if the system came under heavy load these would report higher than truthful times. Take for instance the time command in Linux.

kmcguire@localhost ~ $ time
real 0m0.000s (real time as in clock on wall time)
user 0m0.000s (time processor spent on this process in user space)
sys 0m0.000s (time processor spent on this process in kernel space)

GLneo · Post by **GLneo** » Fri Jul 06, 2007 12:18 pm

multi-threading is fun and all, but MDCrack ( the fastest md5 cracker ) is about 2x as fast as ours. I think it uses the 128bit SSE registers for its math, now a days it's hard to find a CPU without SSE, so an SSE module it can load if the CPU has SSE could greatly inprove the speed. just a thought

os64dev · Post by **os64dev** » Fri Jul 06, 2007 12:33 pm

are you now talking about the hashes per second or are you talking about total application time. The latter is very much subjected to the character set you are using. For instance if you would reorder the charset to the possible occurence of the character in the password. Then you get complete different results.

For the SSE2 i doubt it because it will help because you need parallelism in the algorithm and that is clearly not the case. every iteration needs the previous results. If the SSE2 instuctions would be a lot faster then it might help, but our code if prortable.

Personally i am rather pleased with the 7 Million hashes per second. That is a huge amount and even the single threaded version seems to beat cain (on certain system). I didn't even have any experience with MD5 or brute forcing until this topic started. However i know how to optimise code as i optimised the MP4 AAC decoder a few years back. The improvement was also a large factor. Rumours were that Quicktime took over the sofware from the compagny i was working for but that are just rumours.

Personally i am very iterested in the results of the multithreaded version on Candy's machine.

GLneo · Post by **GLneo** » Fri Jul 06, 2007 1:08 pm

yah, your probably right, just kind of frustrating not knowing how MDcrack is twice as fast

p.s. your really good at optimizing

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 1:25 pm

You are correct about the way you transverse the sequences being in a different order so as to produce shorter times for you and longer times for me, or vice-versa in some situations.

I replaced my clock with your exact implementation of gettimeofday. I still produce slightly higher hashes per second.

One problem I can see is, possibly.

Code: Select all

int genNextSequence(struct threadInfo * th, const char * charset, const int charset_length) 
{ 
    int x; 
    for(x = th->seqLength-1; th->sequence[x] == (charset_length-1); --x) { 

        if(x == 0) { 
            return 0; 
        } 
        th->hashKey[x] = charset[th->sequence[x] = 0]; 
    }
    th->hashKey[x] = charset[++th->sequence[x]]; 
    return 1; 
}

You are accessing a lot of data that is not on the stack, and not likely to be cached while producing extra instructions for offsets.

Here is my equivalent, which differs since you do yours in the next sequence function. The major difference being all my calculations stay accessing the stack instead of a random place in memory determined by malloc.

Code: Select all

for(x = 0; x < sequence_wlen; ++x)
{
	md5buffer[x] = l_charset_map[sequence[x]];
}
int32_t sequence_next(uint8_t *sequence, uint32_t charset_length, uint32_t length)
{
   uint32_t x;
   for(x = 0; sequence[x] == (charset_length-1); ++x)
   {
      if(x == (length-1))
      {
         return 0;
      }
      sequence[x] = 0;
   }
   ++sequence[x];
   return 1;
}

Ninjarider · Post by **Ninjarider** » Fri Jul 06, 2007 2:17 pm

intel manuals are in the mail.

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 2:58 pm

Here is the md5_hash written so that the new automatic tree vectorization in GCC4.x can understand what we are doing and vectorize it automatically. It indeed does produce SS2 code, but it is 2/3 the speed of the generic function. I have hunches why or maybe, but not definitive answer.

Maybe you guys can help me figure out why it is slower?
**gcc -ftree-vectorize -msse2**

Code: Select all

#define ROTATE_LEFT(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define F(x,y,z) (((x) & (y)) | ((~(x)) & (z)))
#define G(x,y,z) (((x) & (z)) | ((y) & (~(z))))
#define H(x,y,z) ((x) ^ (y) ^ (z))
#define I(x,y,z) ((y) ^ ((x) | (~(z))))
#define md5_s1_0 7
#define md5_s1_1 12
#define md5_s1_2 17
#define md5_s1_3 22
#define md5_s2_0 5
#define md5_s2_1 9 
#define md5_s2_2 14
#define md5_s2_3 20
#define md5_s3_0 4
#define md5_s3_1 11
#define md5_s3_2 16
#define md5_s3_3 23
#define md5_s4_0 6
#define md5_s4_1 10
#define md5_s4_2 15
#define md5_s4_3 21

#define ROTATE_LEFT(x, n) (((x) << (n)) | ((x) >> (32-(n))))
#define F(x,y,z) (((x) & (y)) | ((~(x)) & (z)))
#define G(x,y,z) (((x) & (z)) | ((y) & (~(z))))
#define H(x,y,z) ((x) ^ (y) ^ (z))
#define I(x,y,z) ((y) ^ ((x) | (~(z))))
#define MD5_OP4(a,b,c,d,e,f,g,h,i,j) \
	for(q = 0; q < 4; ++q) \
	{ \
		a[q] = c[q] + d(e[q], f[q], g[q]) + X[(h)*4+q] + i; \
		 \
	} \
	for(q = 0; q < 4; ++q) \
	{ \
		a[q] = b[q] + ROTATE_LEFT(a[q], j); \
	}

uint32_t __attribute__((aligned(128))) tsin[] = {0xd76aa478,0xe8c7b756,0x242070db,0xc1bdceee,0xf57c0faf,0x4787c62a,0xa8304613,0xfd469501,0x698098d8,0x8b44f7af,0xffff5bb1,0x895cd7be,
0x6b901122,0xfd987193,0xa679438e,0x49b40821,0xf61e2562,0xc040b340,0x265e5a51,0xe9b6c7aa,0xd62f105d,0x02441453,0xd8a1e681,0xe7d3fbc8,
0x21e1cde6,0xc33707d6,0xf4d50d87,0x455a14ed,0xa9e3e905,0xfcefa3f8,0x676f02d9,0x8d2a4c8a,0xfffa3942,0x8771f681,0x6d9d6122,0xfde5380c,
0xa4beea44,0x4bdecfa9,0xf6bb4b60,0xbebfbc70,0x289b7ec6,0xeaa127fa,0xd4ef3085,0x04881d05,0xd9d4d039,0xe6db99e5,0x1fa27cf8,0xc4ac5665,
0xf4292244,0x432aff97,0xab9423a7,0xfc93a039,0x655b59c3,0x8f0ccc92,0xffeff47d,0x85845dd1,0x6fa87e4f,0xfe2ce6e0,0xa3014314,0x4e0811a1,
0xf7537e82,0xbd3af235,0x2ad7d2bb,0xeb86d391}; 

uint32_t md5_hash4(uint32_t *X, uint32_t *digest)
{
	uint32_t 	__attribute__((aligned(128))) a[4],
			__attribute__((aligned(128))) b[4],
			__attribute__((aligned(128))) c[4],
			__attribute__((aligned(128))) d[4];
	uint32_t x, q;
	for(x = 0; x < 4; ++x)
	{
		a[x] = 0x67452301; 
		b[x] = 0xefcdab89; 
		c[x] = 0x98badcfe; 
		d[x] = 0x10325476;	
	}
	/// round one (rolled for automatic vectorization)
	for(x = 0; x < 4; ++x)
	{
		MD5_OP4(a, b, a, F, b, c, d, x+0, tsin[x+0], md5_s1_0);
		MD5_OP4(d, a, d, F, a, b, c, x+1, tsin[x+1], md5_s1_1);
		MD5_OP4(c, d, c, F, d, a, b, x+2, tsin[x+2], md5_s1_2);
		MD5_OP4(b, c, b, F, c, d, a, x+3, tsin[x+3], md5_s1_3);
	}
	/// round two (rolled for automatic vectorization)
	for(x = 0; x < 4; ++x)
	{
		MD5_OP4(a, b, a, G, b, c, d, x+0, tsin[x+0+16], md5_s2_0);
		MD5_OP4(d, a, d, G, a, b, c, x+1, tsin[x+1+16], md5_s2_1);
		MD5_OP4(c, d, c, G, d, a, b, x+2, tsin[x+2+16], md5_s2_2);
		MD5_OP4(b, c, b, G, c, d, a, x+3, tsin[x+3+16], md5_s2_3);
	}
	/// round three (rolled for automatic vectorization)
	for(x = 0; x < 4; ++x)
	{
		MD5_OP4(a, b, a, H, b, c, d, x+0, tsin[x+0+32], md5_s3_0);
		MD5_OP4(d, a, d, H, a, b, c, x+1, tsin[x+1+32], md5_s3_1);
		MD5_OP4(c, d, c, H, d, a, b, x+2, tsin[x+2+32], md5_s3_2);
		MD5_OP4(b, c, b, H, c, d, a, x+3, tsin[x+3+32], md5_s3_3);
	}
	/// round four (rolled for automatic vectorization)
	for(x = 0; x < 4; ++x)
	{
		MD5_OP4(a, b, a, I, b, c, d, x+0, tsin[x+0+48], md5_s4_0);
		MD5_OP4(d, a, d, I, a, b, c, x+1, tsin[x+1+48], md5_s4_1);
		MD5_OP4(c, d, c, I, d, a, b, x+2, tsin[x+2+48], md5_s4_2);
		MD5_OP4(b, c, b, I, c, d, a, x+3, tsin[x+3+48], md5_s4_3);
	}
	for(x = 0; x < 4; ++x)
	{
		a[x] += 0x67452301;
		b[x] += 0xefcdab89; 
		c[x] += 0x98badcfe; 
		d[x] += 0x10325476;
	}
	for(x = 0; x < 4; ++x)
	{
		if( (digest[0] == a[x]) && (digest[1] == b[x]) && (digest[2] == c[x]) && (digest[3] == d[x]) )
		{
			return x+1;
		}
	}
	return 0;
}

Ninjarider · Post by **Ninjarider** » Fri Jul 06, 2007 2:58 pm

something that may or may not increase spped. not sure how c compiles. i noticed a couple not(?) statements. we could try precomputing by multipling the number by -1. could be done in paires before there needed in the functions. doing this should invoke the fpu to handle the not statements allow the cpu to take care of other stuff.

os64dev · Post by **os64dev** » Fri Jul 06, 2007 3:25 pm

@kevin.

it seems that your code adds tables again and thus additional memory reference instead of inline values. if you remember that was the first improvement from 600 K to 1.2 M hashes per second. This is 50 % and with the current speed of the algorithm 50 % is a lot. Note that i'm not sure about this as i didn't test it. problably the percentage is even worse and SSE2 compensates somewhat. again not sure

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 3:29 pm

I'll unroll it, but the function can become quite larger than the original function. The last time I unrolled it (only using) MMX it had horrible performance.

I think the code causes cache problems because of how large it is when generated. When doing smaller code optimization the speed improves slightly.

os64dev · Post by **os64dev** » Fri Jul 06, 2007 3:36 pm

same with my mt solution i also wanted the hash in the threadinfo structure but that killed performance, due to code and data locality.
so loop unrolling will be bad. maybe a solution as to storing the hash in SSE 2 registers or something like that. the more we store in registers the faster it should work. by theory.

Kevin McGuire · Post by **Kevin McGuire** » Fri Jul 06, 2007 3:50 pm

I unrolled and even substituted ever access of the mb5_buffer(also known as X as a function argument to md5_hash4) with one memory access to the same place every time (A broken md5 hash algorithm). But, never the less it should help to determine if the code size is a problem.

With my testing environment and shell of code around the function it jumped only from ~500,000 per second to ~670,000 per second. .67 * 4 = 2.68 million per second. No where near a four fold increase, or looking in other words it looks like it could possibly amount of a two fold increase after rigorous optimizing you think?

I am also having trouble optimizing the barrel rolls with SSE instructions. It has to drop down to using the generic register. This could be a problem..

/// dropping out to do barrel rolls with generic registers
80485e7: 0f 29 45 e8 movaps %xmm0,0xffffffe8(%ebp)
80485eb: 8b 45 e8 mov 0xffffffe8(%ebp),%eax
80485ee: 8b 55 f4 mov 0xfffffff4(%ebp),%edx
80485f1: c1 c8 19 ror $0x19,%eax
80485f4: 03 45 d8 add 0xffffffd8(%ebp),%eax
80485f7: c1 ca 19 ror $0x19,%edx
80485fa: 03 55 e4 add 0xffffffe4(%ebp),%edx
80485fd: 89 45 e8 mov %eax,0xffffffe8(%ebp)
8048600: 8b 45 ec mov 0xffffffec(%ebp),%eax
8048603: 89 55 f4 mov %edx,0xfffffff4(%ebp)
8048606: c1 c8 19 ror $0x19,%eax
8048609: 03 45 dc add 0xffffffdc(%ebp),%eax
804860c: 89 45 ec mov %eax,0xffffffec(%ebp)
804860f: 8b 45 f0 mov 0xfffffff0(%ebp),%eax
8048612: c1 c8 19 ror $0x19,%eax
8048615: 03 45 e0 add 0xffffffe0(%ebp),%eax
8048618: 89 45 f0 mov %eax,0xfffffff0(%ebp)

OSDev.org

brute forcer

MD5 Computation Using MMX