Wednesday, October 19, 2011

Java Class File Layout and Structure

There are 10 basic sections to the Java Class File structure:

  • Magic Number: 0xCAFEBABE
  • Version of Class File Format: the minor and major versions of the class file
  • Constant Pool: Pool of constants for the class
  • Access Flags: for example whether the class is abstract, static, etc.
  • This Class: The name of the current class
  • Super Class: The name of the super class
  • Interfaces: Any interfaces in the class
  • Fields: Any fields in the class
  • Methods: Any methods in the class
  • Attributes: Any attributes of the class (for example the name of the sourcefile, etc.)

There is a handy mnemonic for remembering these 10: My Very Cute Animal Turns Savage In Full Moon Areas.

Magic, Version, Constant, Access, This, Super, Interfaces, Fields, Methods, Attributes (MVCATSIFMA)

Magic Numbers in Files

Detecting such constants in files is a simple and effective way of distinguishing between many file formats and can yield further run-time information.

Examples
  • Compiled Java class files (bytecode) start with hex CAFEBABE. When compressed with Pack200 the bytes are changed to CAFED00D.
  • GIF image files have the ASCII code for "GIF89a" (47 49 46 38 39 61) or "GIF87a" (47 49 46 38 37 61)
  • JPEG image files begin with FF D8 and end with FF D9. JPEG/JFIF files contain the ASCII code for "JFIF" (4A 46 49 46) as a null terminated string. JPEG/Exif files contain the ASCII code for "Exif" (45 78 69 66) also as a null terminated string, followed by more metadata about the file.
  • PNG image files begin with an 8-byte signature which identifies the file as a PNG file and allows detection of common file transfer problems: \211 P N G \r \n \032 \n (89 50 4E 47 0D 0A 1A 0A). That signature contains various newline characters to permit detecting unwarranted automated newline conversions, such as transferring the file using FTP with the ASCII transfer mode instead of the binary mode.
  • Standard MIDI music files have the ASCII code for "MThd" (4D 54 68 64) followed by more metadata.
  • Unix script files usually start with a shebang, "#!" (23 21) followed by the path to an interpreter.
  • PostScript files and programs start with "%!" (25 21).
  • PDF files start with "%PDF" (hex 25 50 44 46).
  • MS-DOS EXE files and the EXE stub of the Microsoft Windows PE (Portable Executable) files start with the characters "MZ" (4D 5A), the initials of the designer of the file format, Mark Zbikowski. The definition allows "ZM" (5A 4D) as well, but this is quite uncommon.
  • Amiga's black screen of death called Guru Meditation, in its first version, when the machine hung up for uncertain reasons, showed the hexadecimal number 48454C50, which stands for "HELP" in hexadecimal ASCII characters (48=H, 45=E, 4C=L, 50=P).
  • TIFF files begin with either II or MM followed by 42 as a two-byte integer in little or big endian byte ordering. II is for Intel, which uses little endian byte ordering, so the magic number is 49 49 2A 00. MM is for Motorola, which uses big endian byte ordering, so the magic number is 4D 4D 00 2A.
  • Unicode text files encoded in UTF-16 often start with the Byte Order Mark to detect endianness (FE FF for big endian and FF FE for little endian). UTF-8 text files often start with the UTF-8 encoding of the same character, EF BB BF.
  • LLVM Bitcode files start with BC (0x42, 0x43)
  • Microsoft Office document files start with D0 CF 11 E0, which is visually suggestive of the word "DOCFILE0".
  • Headers in ZIP files begin with "PK" (50 4B), the initials of Phil Katz, author of DOS compression utility PKZIP.