Page 2 of 3

Re: Assembler syntax

Posted: Thu Sep 11, 2014 12:03 pm
by SoLDMG
Jezze wrote:I totally agree with you there SolDMG. C is so close to the perfect language for me that there isnt much I would like to be different in the language itself besides adding more syntactic restrictions and add something better to define data structures besides just using structs and/or bitfields together with either enums or defines for register definitions. Where I think the problem is today is in the tools where they are just too big and too bloated with options.
Exactly my point, plus that there is an actual user group for it. There are thousands (maybe millions) of C programs that people actually use. Every production-grade operating system kernel (Linux, BSD, Windows, OS X, Minix, GNU/Mach?) is basically written in C. No one is waiting for another C-like language, or another language at all. We already have hundreds of languages. What we don't have are competent tools for the right job. Sure, the GNU Compiler Collection is moldable to an extent and the LLVM project does an amazing job, but it's getting kinda bloated as well. That's also why I support the idea of open standards that anyone can use, it's so that tools can be continously written and replaced every now and then, so we keep the code fresh, small and thus usable.

Re: Assembler syntax

Posted: Fri Sep 12, 2014 9:10 am
by SoLDMG
Bencz wrote: Hi!

Why u not generate a ".obj" file, using the OMF obj format ?
http://en.wikipedia.org/wiki/Relocatabl ... ule_Format
Why would I support a pretty much dead format?
Bencz wrote: In code-gen of your C compiler, u can make a struct, with machine code and text asm code..., in that struct, u can út the both sintax, AT&T or Intel, the user choice for generate machine coide or asm text code

Code: Select all

enum 
{ 
    push_eax=0,....
}

intructions instru[] =
{
    { "50", "push eax", "pushl %eax"}, ....
}

That is what I am going to do, after reading this. This is actually a pretty good idea.

Re: Assembler syntax

Posted: Fri Sep 12, 2014 5:15 pm
by Bencz
That was what I did on my C compiler ....
But my compiler generates an win32 EXE.

Re: Assembler syntax

Posted: Sat Sep 13, 2014 5:36 am
by SoLDMG
Bencz wrote:That was what I did on my C compiler ....
But my compiler generates an win32 EXE.
Oh... Well my toolchain will probably support a lot of formats. ROMF just isn't used anymore. PE and ELF are 'the name of the game' so to speak.

You probably already knew that though.

Re: Assembler syntax

Posted: Sat Sep 13, 2014 6:38 am
by alexfru
Bencz wrote: Why u not generate a ".obj" file, using the OMF obj format ?
In practice, OMF comes with a number of incompatible and non-universally supported extensions and misinterpretations by various implementors, which is why there are very few OMF tools that are actually compatible at the object file level. Besides that, it's quite complex on its own.

I've decided to use ELF and have been happy with the decision. There's even a tiny 16-bit extension (supported by GNU as and NASM) that allows one to compile 16-bit code into ELF (just 2 more relocation types to reflect 16-bit relocations) and I'm taking advantage of that in my compiler. The compiler supports 16-bit and 32-bit modes and generates assembly code for NASM and then links the resultant ELFs. One format for 16 bits and 32 bits, one assembler for DOS, Windows and Linux. It's a simple and sane format. At least, for x86 and static linking. About the only thing I dislike about it is the size of the symbol table. Every symbol takes 16 bytes plus whatever is needed for its name. That's a bit too much, IMHO.

Re: Assembler syntax

Posted: Sat Sep 13, 2014 6:53 am
by Brendan
Hi,
Bencz wrote:In code-gen of your C compiler, u can make a struct, with machine code and text asm code..., in that struct, u can út the both sintax, AT&T or Intel, the user choice for generate machine coide or asm text code

Code: Select all

enum 
{ 
    push_eax=0,....
}

intructions instru[] =
{
    { "50", "push eax", "pushl %eax"}, ....
}

Have you got any idea how large that table is going to be?

For just one addressing mode (e.g. with SIB, like "mov rax,[rbx+rcx*4+offset]") there's an 8-bit REX prefix, an 8-bit ModRM then an 8-bit SIB; which means there's 2**(8+8+8) = 16777216 unique encodings for that addressing mode. For each instruction there's probably 3 opcodes/addressing modes on average; and there's probably over 200 instructions. As a rough estimate, your lookup table is probably going to need about 1 billion entries. If you assume 32 bytes per entry you'd be looking at a total size of about 32 GiB.

You will need to generate at least part of the instruction using code - e.g. maybe one table containing the instruction's mnemonic (without operands) and a list of "opcode and addressing mode" pairs; then use the addressing mode (from the first table) to figure out how to generate the operands (possibly using several smaller lookup tables).

Of course you could also do the smart thing; and generate machine code instead of assembly so that you don't need to bother generating text, then parsing the text and assembling. If anyone wants plain text, there's plenty of decent disassemblers floating around for both Intel syntax and AT&T syntax. ;)


Cheers,

Brendan

Re: Assembler syntax

Posted: Sat Sep 13, 2014 10:57 am
by alexfru
Brendan wrote:
Bencz wrote: In code-gen of your C compiler, u can make a struct, with machine code and text asm code..., in that struct, u can út the both sintax, AT&T or Intel, the user choice for generate machine coide or asm text code

Code: Select all

enum 
{ 
    push_eax=0,....
}

intructions instru[] =
{
    { "50", "push eax", "pushl %eax"}, ....
}
Have you got any idea how large that table is going to be?
:) A small and simple x86 assembler should take several KLOCs of C code, 3-5 KLOCs.

Re: Assembler syntax

Posted: Sat Sep 13, 2014 1:12 pm
by Bencz
Get my intructions table... and work's very well for me...

Code: Select all

instructions op[] = {
    { "8D85%08X", "lea eax,[ebp+]"      },
    { "50",     "push eax"              },
    { "51",     "push ecx"              },
    { "55",     "push ebp"              },
    { "58",     "pop eax"               },
    { "59",     "pop ecx"               },
    { "03C1",   "add eax,ecx"           },
    { "05%08X", "add eax"               },
    { "0101",   "add [ecx],eax"         },
    { "660101", "add [ecx],ax"          },
    { "0001",   "add [ecx],al"          },
    { "83C4%02X", "add1 esp"            }, // byte operand
    { "81C4%08X", "add4 esp"            }, // int  operand
    { "8300%02X", "add1 dwordptr[eax]"  },
    { "8100%08X", "add4 dwordptr[eax]"  },
    { "8301%02X", "add1 dwordptr[ecx]"  },
    { "8101%08X", "add4 dwordptr[ecx]"  },
    { "8000%02X", "add byteptr[eax]"    },
    { "8001%02X", "add byteptr[ecx]"    },
    { "48",     "dec eax"               },
    { "2BC1",   "sub eax,ecx"           },
    { "2901",   "sub [ecx],eax"         },
    { "662901", "sub [ecx],ax"          },
    { "2801",   "sub [ecx],al"          },
    { "83EC%02X", "sub1 esp"            },
    { "81EC%08X", "sub4 esp"            }, 
    { "8328%02X", "sub1 dwordptr[eax]"  },
    { "8128%08X", "sub4 dwordptr[eax]"  },
    { "8329%02X", "sub1 dwordptr[ecx]"  },
    { "8129%08X", "sub4 dwordptr[ecx]"  },
    { "8028%02X", "sub byteptr[eax]"    },
    { "8029%02X", "sub byteptr[ecx]"    },
    { "0FAFC1",   "imul eax,ecx"        }, 
    { "69C0%08X", "imul eax,eax"        },
    { "99",     "cdq"                   }, // Convert Double to Quad.
    { "F7F9",   "idiv ecx"              }, 
    { "3BC8",   "cmp ecx,eax"           },
    { "83F8%02X", "cmp1 eax"            },
    { "81F8%08X", "cmp4 eax"            },
    { "80FC%02X", "cmp ah"              },
    { "F6C4%02X", "test ah"             },
    { "23C1",   "and eax,ecx"           },
    { "80E4%02X", "and ah"              },
    { "09C0",   "or eax,eax"            },
    { "0BC1",   "or eax,ecx"            },
    { "0901",   "or [ecx],eax"          },
    { "660901", "or [ecx],ax"           },
    { "0801",   "or [ecx],al"           },
    { "31C0",   "xor eax,eax"           },
    { "33C1",   "xor eax,ecx"           },
    { "80F4%02X", "xor ah"              },
    { "D3E0",   "shl eax,cl"            },
    { "D3E8",   "shr eax,cl"            },
    { "F7D8",   "neg eax"               }, 
    { "89D0",   "mov eax,edx"           },
    { "8BC8",   "mov ecx,eax"           },
    { "B8%08X", "mov eax"               },
    { "B8V%06X_", "mov eax_v"           },
    { "B8X%06X_", "mov eax_x"           },
    { "B8fn_%04X_", "mov eax_fn"        }, 
    { "B8FN_%04X_", "mov eax_FN"        }, 
    { "B9%08X", "mov ecx"               },
    { "B9V%06X_", "mov ecx_v"           },
    { "BA%08X", "mov edx"               },
    { "C700%08X", "mov dwordptr[eax]"   },
    { "C700V%06X_", "mov dwordptr[eax]_v"},
    { "66C700%04X", "mov wordptr[eax]"  },
    { "C600%02X",   "mov byteptr[eax]"  },
    { "8B00",   "mov eax,[eax]"         },
    { "8B01",   "mov eax,[ecx]"         },
    { "89E5",   "mov ebp,esp"           },
    { "8901",   "mov [ecx],eax"         },
    { "668901", "mov [ecx],ax"          },
    { "8801",   "mov [ecx],al"          },
    { "0FBF00", "movsx eax,wordptr[eax]"}, 
    { "0FBE00", "movsx eax,byteptr[eax]"}, 
    { "91",     "xchg eax,ecx"          },
    { "74%02X", "jz "                   },
    { "75%02X", "jnz "                  },
    { "E9ln_%04X_", "jmp "              },
    { "0F85ln_%04X_", "jne "            },
    { "0F82%08X", "jb "                 },
    { "7C%02X", "jl "                   },
    { "7D%02X", "jge "                  },
    { "7E%02X", "jle "                  },
    { "7F%02X", "jg "                   },
    { "E8fn_%04X_", "call "             }, 
    { "FF10",   "call dwordptr[eax]"    },
    { "FF15X%06X_", "call dwordptr[]"   },
    { "C9",     "leave"                 },
    { "C3",     "ret"                   },
    { "0F94C0", "sete al"               },
    { "0F95C0", "setne al"              },
    { "D9E0",   "fchs"                  },
    { "D9C9",   "fxch st(1)"            }, 
    { "DD00",   "fld qwordptr[eax]"     },
    { "DD01",   "fld qwordptr[ecx]"     },
    { "DD5C2400", "fst qwordptr[esp]"   },
    { "DFE0",   "fstsw"                 },
    { "DD18",   "fstp qwordptr[eax]"    },
    { "DD19",   "fstp qwordptr[ecx]"    },
    { "DEC1",   "faddp st(1),st"        }, // +=
    { "DEE9",   "fsubrp st(1),st"       }, // -=
    { "DEC9",   "fmulp st(1),st"        },
    { "DEF9",   "fdivrp st(1),st"       },
    { "DAE9",   "fucompp"               },
    { "DB1C24", "fistp dwordptr[esp]"   },
    { "DC25V%06X_", "fsub qwordptr[]_v" },
 };

Re: Assembler syntax

Posted: Sat Sep 13, 2014 5:00 pm
by Brendan
Hi,
Bencz wrote:Get my intructions table... and work's very well for me...
It's not even slightly close to "works very well".

My guess is that your compiler does no optimisation at all and just uses the CPU as a stack machine (constantly pushing and popping while half of the CPU's registers aren't used); and the generated code probably runs about 100 times slower than it should.


Cheers,

Brendan

Re: Assembler syntax

Posted: Sun Sep 14, 2014 7:09 am
by Bencz
I'm not worried about it, it's for just study.

Re: Assembler syntax

Posted: Sun Sep 14, 2014 8:13 am
by Bencz
When I developed this compiler, I had the greatest intention to study EXE format, then I was not a bit worried about optimizing code....

Re: Assembler syntax

Posted: Sun Sep 14, 2014 2:06 pm
by SoLDMG
For people actually still wondering what syntax I'm considering more and more to use, it's the Intel syntax with AT&T-like directives, and an exclamation mark means a comment. The latter can be changed really easily though, if people don't like it. So a sample bootsector would look like this:

Code: Select all

! A boot sector.
.bits16
.org 0x7c00

! Include the BPB.
.include "bpb.inc"

start:
	jmp short boot
	nop

bpb:
	.insert _bpb

print:
	mov ah, 0Eh
	repeat:
		lodsb
		or al, 0
		jz done
		int 0x10
		jmp repeat
	done:
		ret

boot:
	! Print a message.
	mov si, msg
	call print
	! Halt the system.
	cli
	hlt

msg:
	.ascii	"Hallo wereld!"
	.hex	0x0A
	.hex	0x0D
	.dec	0

! Make sure the binary size is 510 bytes + boot signature, and make the filler 0.
.size 510 0
! Boot signature.
.hex 0x55
.hex 0xAA
And then of course "bpb.inc" would contain the BIOS parameter block, and it would be inserted.

I do realize I'm REALLY (almost discriminatingly to other probably better written assemblers) re-inventing the wheel here.

Re: Assembler syntax

Posted: Sun Sep 14, 2014 3:42 pm
by b.zaar
If you are not using the characters ; or # or // for another purpose why are you creating a new comment character?

Re: Assembler syntax

Posted: Sun Sep 14, 2014 10:09 pm
by Wajideu
I didn't read every post in this thread, so I'm not sure what you've decided on, but I'd highly suggest not creating your own assembly language. It just defeats the purpose. Adding in things like macros, structures, enumerations, local labels, etc. is a good idea though.
Brendan wrote:I remember a presentation (which was actually C++ syntax, and may have been about a code sanitiser that Google built out of parts of the LLVM project) where they investigated where programmer's time is spent and found that most programmers spend about 20% of their time just diddling with white-space.
I know this is an old post, but I figured I'd reply that this often depends on the person. K&R users are probably far more likely to spend time messing around with whitespace than Allman style users because Allman style is much easier to comment. People who use spaces probably spend more time on whitespace than people who use tabs as well. I tend to use GNU-style these days with pre-ansi C function delarations.

It took some time to get used to at first, but it's very easy to comment (especially function arguments) and the indentation better reflects how the code is actually parsed.

On a side note, one idea I came up with that I think would be a good extension to the C language is a behavior declaration. It'd allow better namespacing and class-like functionality.

Re: Assembler syntax

Posted: Mon Sep 15, 2014 8:31 am
by SoLDMG
b.zaar wrote:If you are not using the characters ; or # or // for another purpose why are you creating a new comment character?
It looks nice, I guess. And what would you want it to be? Just getting as much feedback as possible.