Does the program have to start from the main function?

Posted Jun 6, 20208 min read

Does the program have to start from the main function? This article involves knowledge about static links.

Two questions are asked about static links:

Q:Each target file has multiple segments. When the target file is linked into an executable file, how are the segments in the input target file merged into the output file?

A:Merge similar segments, merge all .text segments into the .text segment of the output file, and merge all .data segments into the .data segment of the output file.

Q:How does the linker allocate space and addresses in the output file for them?

A:There are two steps involved in program linking:

  1. Space and address allocation:scan all input target files to obtain the length attribute and position of each segment of them, collect all symbol definitions and symbol references in the symbol table in the input target file, and put them into a global symbol table , Merge all the segments, calculate the combined length and position of each segment in the output file, and establish the mapping relationship.
  2. Symbol resolution and relocation:Use all the information collected in the first step to read the data and relocation information in the input file segment, perform symbol resolution and relocation, adjust the address in the code, and relocate each segment The positioning instructions and data are "patched" so that they all point to the correct location.

tips:External symbols refer to the symbols that the target file needs to reference, but are defined in other target files. The address of the external symbol before linking is 000000 or the like, and the executable file after linking can see that these external symbols are all addressed. . Linking is to put similar segments together, first find the offset address of the segment, and then find the offset of the symbol in the segment, so that the address of the symbol in the entire executable program can be determined.

For those symbols that need to be relocated, they will be placed in the relocation table, also known as the relocation section, that is.rel.data,.rel.text, etc. If the .text section is relocated, there is .rel. For the text section, if the .data section is relocated, there is a .rel.data section. You can use objdump to view the relocation table of the target file.

Source code:

int main() {
    printf("program meow\n");
    return 0;
}
gcc -c test

objdump -r test.o

test.o:file format elf64-x86-64

RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000007 R_X86_64_PC32 .rodata-0x0000000000000004
000000000000000c R_X86_64_PLT32 puts-0x0000000000000004


RELOCATION RECORDS FOR [.eh_frame]:
OFFSET TYPE VALUE
0000000000000020 R_X86_64_PC32 .text

You can also view the symbols that need to be relocated using nm:

nm -u test.o
                 U _GLOBAL_OFFSET_TABLE_
                 U puts

For the UND type, such undefined symbols are all because of their relocation items in the object file. After the linker scans all input object files, all such undefined symbols should be able to be in the global symbol Found in the table, otherwise the symbol undefined error is reported.

Note:Printf is clearly used in our code, why does it refer to the symbol of puts, because the compiler will replace printf with only one string parameter by default to puts, which can save the time of format parsing, use- fno-builtin will turn off this built-in function optimization option, as follows:

~/test$gcc -c -fno-builtin testlink.cc -o test.o
~/test$nm test.o
                 U _GLOBAL_OFFSET_TABLE_
0000000000000000 T main
                 U printf

tips:Current programs and libraries are usually very large. An object file may contain hundreds or thousands of functions or variables. When any function or variable in a certain object file is needed, the entire target needs to be used. The files are all linked in, which means that the functions that are not used will also be linked in, which will cause the link output file to become very large, resulting in a waste of space.

There is a compilation option called function-level linking, which allows a function or variable to be saved in a single segment. When the linker needs a function, it will be merged into the output file. For functions that are not used, Abandon them to reduce space waste, but this will slow down the compilation and linking process. The compilation options of the GCC compiler are:

-ffunction-sections
-fdata-sections

Many people may think that the program is started and ended by the main function, but it is not. Before the main function is called, in order to ensure that the program can proceed smoothly, the process execution environment must be initialized first, such as heap allocation initialization, thread subsystem, etc. The C++ global object constructor is also executed during this period, and the global destructor is executed after main.

The entry point of the general Linux program is the __start function, which has two sections:

  • .init section:The initialization code of the process. When a program starts to run, the code in the .init section will be run before the main function is called.
  • .fini section:the process termination code. When the main function exits normally, glibc will arrange to execute the section of code.
How to specify program entry

The -e parameter can be used to specify the program entry during the ld link process. Since a short printf function actually depends on multiple link libraries, it is also not convenient for us to use a link script to link the target file with all these dependent libraries, so use The following built-in assembly program prints a string of characters. This program does not rely on any link library to print out the contents of the string. Readers don t have to worry if they don t understand the meaning, just need to understand the link knowledge introduced below. it is good.

code show as below:

const char* str = "hello";

void print() {
    asm("movl $13,%%edx \n\t"
        "movl str,%%ecx \n\t"
        "movl $0,%%ebx \n\t"
        "movl $4,%%eax \n\t"
        "int $0x80 \n\t"
        :
        :"r"(str):"edx", "ecx", "ebx");
}


void exit() {
    asm("movl $42,%ebx \n\t"
        "movl $1,%eax \n\t"
        "int $0x80 \n\t");
}

void nomain() {
    print();
    exit();
}

Use the following command to generate the target file:

gcc -c -fno-builtin test.cc

Look at the symbol of the output test.o:

~/test$nm -a test.o
0000000000000000 b .bss
0000000000000000 n .comment
0000000000000000 d .data
0000000000000000 d .data.rel.local
0000000000000000 r .eh_frame
0000000000000000 n .note.GNU-stack
0000000000000000 r .rodata
0000000000000000 t .text
0000000000000026 T _Z4exitv
0000000000000000 T _Z5printv
0000000000000039 T _Z6nomainv
0000000000000000 D str
0000000000000000 a test.cc

Here, since my source file ends in .cc, it is compiled in C++, so the symbol becomes the above form, if it becomes test.c, the symbol is as follows:

~/test$gcc -c -fno-builtin test.c -o test.o
~/test$nm -a test.o
0000000000000000 b .bss
0000000000000000 n .comment
0000000000000000 d .data
0000000000000000 d .data.rel.local
0000000000000000 r .eh_frame
0000000000000000 n .note.GNU-stack
0000000000000000 r .rodata
0000000000000000 t .text
0000000000000026 T exit
0000000000000039 T nomain
0000000000000000 T print
0000000000000000 D str
0000000000000000 a test.c

Then use -e to specify the entry function symbol:

~/test$ld -static -e nomain -o test test.o
~/test$./test
hello

You can use the -T parameter to specify the link script during the ld link process. You can view the default link script through ld -verbose. The original text is too long. Here is a brief screenshot:

$ld -verbose
GNU ld(GNU Binutils for Ubuntu) 2.30
  Supported emulations:
   elf_x86_64
   elf32_x86_64
   elf_i386
   elf_iamcu
   i386linux
   elf_l1om
   elf_k1om
   i386pep
   i386pe
using internal linker script:
==================================================
/* Script for -z combreloc:combine and sort reloc sections */
/* Copyright(C) 2014-2018 Free Software Foundation, Inc.
   Copying and distribution of this script, with or without modification,
   are permitted in any medium without royalty provided the copyright
   notice and this notice are preserved. */
OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64",
              "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)
SEARCH_DIR("=/usr/local/lib/x86_64-linux-gnu"); SEARCH_DIR("=/lib/x86_64-linux-gnu"); SEARCH_DIR("=/usr/lib/x86_64-linux-gnu") ; SEARCH_DIR("=/usr/lib/x86_64-linux-gnu64"); SEARCH_DIR("=/usr/local/lib64"); SEARCH_DIR("=/lib64"); SEARCH_DIR("=/usr/lib64") ; SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib"); SEARCH_DIR("=/usr/x86_64-linux-gnu/lib64") ; SEARCH_DIR("=/usr/x86_64-linux-gnu/lib");
SECTIONS
{
  /* Read-only sections, merged into text segment:*/
  PROVIDE(__executable_start = SEGMENT_START("text-segment", 0x400000));. = SEGMENT_START("text-segment", 0x400000) + SIZEOF_HEADERS;

  .init:
  {
    KEEP(*(SORT_NONE(.init)))
  }
  .plt:{*(.plt) *(.iplt)}
  .plt.got:{*(.plt.got)}
  .plt.sec:{*(.plt.sec)}
  .text:
  {
    *(.text.unlikely .text.*_unlikely .text.unlikely.*)
    *(.text.exit .text.exit.*)
    *(.text.startup .text.startup.*)
    *(.text.hot .text.hot.*)
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em. */
    *(.gnu.warning)
  }
  .fini:
  {
    KEEP(*(SORT_NONE(.fini)))
  }
  .rodata:{*(.rodata .rodata.* .gnu.linkonce.r.*)}
  /DISCARD/:{*(.note.GNU-stack) *(.gnu_debuglink) *(.gnu.lto_*)}
}

Customize a simple link script test.lds here

ENTRY(nomain)

SECTIONS
{
    . = 0x8048000 + SIZEOF_HEADERS;
    tinytext:{*(.text) *(.data) *(.rodata)}
    /DISCARD/:{*(.comment)}
}

Then use -T to specify the link script:

~/test$ld -static -T test.lds -e nomain -o test test.o
~/test$./test
hello

The above tinytext line refers to merging the contents of the .text section, .data section, and .rodata section into the tinytext section, and using readelf to view the information of the section.

~/test$readelf -S test
~/test$There are 6 section headers, starting at offset 0x482a0:

Section Headers:
  [Nr]Name Type Address Offset
       Size EntSize Flags Link Info Align
  [0]NULL 0000000000000000 00000000
       0000000000000000 0000000000000000 0 0 0
  [1].eh_frame PROGBITS 00000000080480b0 000480b0
       0000000000000078 0000000000000000 A 0 0 8
  [2]tinytext PROGBITS 0000000008048128 00048128
       0000000000000066 0000000000000000 WAX 0 0 8
  [3].shstrtab STRTAB 0000000000000000 0004826e
       000000000000002e 0000000000000000 0 0 1
  [4].symtab SYMTAB 0000000000000000 00048190
       00000000000000c0 0000000000000018 5 4 8
  [5].strtab STRTAB 0000000000000000 00048250
       000000000000001e 0000000000000000 0 0 1
Key to Flags:
  W(write), A(alloc), X(execute), M(merge), S(strings), l(large)
  I(info), L(link order), G(group), T(TLS), E(exclude), x(unknown)
  O(extra OS processing required) o(OS specific), p(processor specific)

Tool tips

About static link library:

ar rcs libxxx.a xx1.o xx2.o package static link library
ar -t libc.a See what object files are in the static link library
ar -x libc.a will extract all the target files to the current directory
gcc --verbose can view the entire compilation link step

About objdump:

objdump -i View native target architecture
objdump -f displays file header information
objdump -d disassembler
objdump -t displays the symbol table entry, what symbol does each object file have
objdump -r display file relocation entry, relocation table
objdump -x displays all available header information, equal to -a -f -h -r -t
objdump -H help

About analyzing ELF file format:

readelf -h lists the file header
readelf -S lists each segment
readelf -r lists the relocation table
readelf -d lists dynamic segments

About viewing the symbol information of the target file:

nm -a displays all symbols
nm -D displays dynamic symbols
nm -u displays only undefined external symbols
nm -defined-only displays only defined symbols

Explanation of symbols:

If the symbol type is lowercase, it indicates that the symbol is a local symbol, and uppercase indicates that the symbol is a global symbol.

  • A:The value of this symbol is absolute, and it is not allowed to change in the future linking process. Such symbol values often appear in the interrupt vector table. For example, symbols are used to indicate the position of each interrupt vector function in the interrupt vector table.
  • B:The value of this symbol appears in the .bss section, uninitialized global and static variables.
  • C:The value of this symbol is in the COMMON section, and all the symbols in it are weak symbols.
  • D:The symbol is located in the data segment.
  • I:the symbol's indirect reference to another symbol
  • N:debug symbol
  • R:The symbol is located in the read-only data area
  • T:The symbol is located in the code segment
  • U:The symbol is not defined in the current file, it is defined in another file
  • ?:The symbol type is not defined

References

https://linuxtools-rst.readth...

"Self-cultivation of programmers"
For more articles, please pay attention to my V X nickname:Lord Meow, welcome to communicate.

Related Posts