Review Show
Goal: Differentiate different integer encodings.
Review: | Newish: |
---|---|
- SHA | - stdint |
- Bits | - fileio |
- Arrays | - htonll |
- Printb | - memset |
- There are no required exercises of this lab.
- It is supplementary material to the SHAinC homework.
Option
andhtonll
are the major optional exercises that support SHAinC.- They are what I regard as likely to be the optimal solutions to problems I expect you to encounter.
- ‘Memset’ might, or might not, help a lot (depending on how you think).
Arrays
will help you at high probability if you encounter a segmentation fault.
Podman Show
Setup
- For this lab, I used the following Containerfile
- Same as the C89/99 lecture
Containerfile
FROM ubuntu
RUN apt update && apt install gcc vim -y
- I built via the following:
podman build -t c89_99 .
- I conducted the full lab within a single container’s
vim
instance.
podman run -it printb vim endian.c
- I conducted all work within this
vim
window and its:term
stdint Show
stdint.h
- Before doing anything else, we should get 32 bit words.
- I use
stdint.h
. - Change last weeks homework to:
- Add an
include
and - Remove
unsigned
by converting to uint32_t
- Add an
- Say I had this (incorrect) code:
macros.c
#define CHOICE(e,f,g) ((e)?(f):(g))
#define MEDIAN(e,f,g) ((!!(e) + !!(f) + !!(g)) > 1)
#define ROTATE(a,n) ((a) >> (n))
int main() {
unsigned e = 0xAA, f = 0x55, g = 0x66;
(e,g,g);
CHOICE(e,f,g);
MEDIAN(e,f);
ROTATEreturn 0;
}
- Include
stdint
and update type names:
macros.c
#include <stdint.h> / * !!! NEW */
#define CHOICE(e,f,g) ((e)?(f):(g))
#define MEDIAN(e,f,g) ((!!(e) + !!(f) + !!(g)) > 1)
#define ROTATE(a,n) ((a) >> (n))
int main() {
uint32_t e = 0xAA, f = 0x55, g = 0x66; /* !!! NEW */
(e,g,g);
CHOICE(e,f,g);
MEDIAN(e,f);
ROTATEreturn 0;
}
An astute student may note that one obvious way to make such change is to:
- Open Python
- Read in
macros.c
viaopen
into a string variable. - Close the file descriptor to
macros.c
- Perform a replace from
unsigned to
uint32_t`. - Write the file to
macros.c
While in this case we are performing a single string substitutions, and Python is hardly the best choice for so simple a transform, it is always appropriate to use scripts, rather than manual methods, write and edit code.
- You also need to use exactly one 64 bit numerical value, l
- The message length in bits.
- We recall the SHA-2 standard caps messages at length
2 ** 62 - 1
- Usually (almost always) a multiple of 8.
- Rarely needs all 64 bits, but is required to have them for the standard.
- Other values - hash values, messages chunks, etc, are 32 bit.
macros.c
#include <stdint.h> / * !!! NEW */
int main() {
uint32_t h_i[0x08]; /* 256 bit hash */
uint32_t m_i[0x10]; /* 512 bit message chunk */
uint64_t l; /* 64 bit message length */
/* Etc, etc. */
return 0;
}
- We will use these variables to read information from a file.
FileIO Show
I delivered remarks on file io in a previous term using these slides.
File I/O
- We uphold the following exemplar of SHA256:
sha256sum
$ echo "15 characters." > 15char.txt
$ wc 15char.txt
1 2 15 15char.txt
$ cat 15char.txt
15 characters.
$ sha256sum 15char.txt
5794032f0c0c7ec2c1f43ac9500f65076ad65ec45b8f76e7e2e4cf882b55c3bb 15char.txt
- It computes an SHA-256 hash based on a filename.
- We introduce file input (not really output) for equivalence.
- fopen()
- fread()
- fprintf()
- fclose()
fopen()
- C
fopen
is almost identical to Pythonopen
- Take a filename, and
- A mode, and
- Return a “file pointer”
- In Python, an
_io.TextIOWrapper
>>> open("lipsum.txt", "r") <_io.TextIOWrapper name='lipsum.txt' mode='r' encoding='cp1252'>
- In C, a
FILE *
FILE *fopen(const char *pathname, const char *mode);
- In Python, an
- We capture the return value in a variable for use with other functions.
An astute student may note that while the Python script can be run directly, testing a line of C code requires additional supporting lines-of-code (LoC) in a complete .c
file with:
- An
#include
preprocessor directive to incorporate thestdio
library - A
main
function, likely with areturn 0;
While in this case we are not using this line of code to do anything, versus single lines of Python, single lines of C code are not necessarily independently interpretable or testable.
Null Checking
- In C, and perhaps in Python, it is common to check if a
fopen
call is successful.fopen
is something called a “system call” like printf, and should be “null checked”.- We:
- Capture the return value in a variable, and
- Compare the return value to
NULL
(that is, zero), and - If the return value is
NULL
, exit with an error code.- Usually in C this is
exit(1)
. - Included in “stdlib.h”
- Usually in C this is
- Here is an example:
- We note that “f_name” is of type
char *
and includes the file extension.
#include <stdlib.h> /* exit */ int main() {} char *f_name = "my_file.txt"; FILE *fp = fopen(f_name, "r"); /* read mode */ if (fp == NULL) { (1); exit} return 0; }
- We note that “f_name” is of type
- You are under no real obligation to null check but…
- Without null checks, very bad times at low probability.
- Ground truth for return values is from the “man pages”
- Or use ‘man fopen’ in a non-minimized (not
podman
) Linux system.
fread()
- Once you have opened a file and null checked, it may be read.
- As a rule, read files into character arrays.
- Not quite strings, but not quite not strings.
- For SHA-256, read 512 bits at a time.
- That is 64 characters/bytes/
uint8t_t
s
uint8_t bytes[64];
- That is 64 characters/bytes/
- C
fread
is quite distinct from Python.read()
- Python
.read
is object oriented (not C-like)
- Python
fread
takes four arguments:
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream)
- A memory location
ptr
into which to read bytes.- Call it the “dst” or “destination”, perhaps.
- A
size_t
“size”, how big each thing to read is.size_t
is usually auint64_t
, but doesn’t have to be.- Would be of size
1
forchar
- Would be of size
8
foruint64_t
- Would be of size
sizeof(int)
for the defaultint
type.
- A
size_t
‘nmemb’ for “\(n\) members”- The number of things of ‘size’ to read.
- A
FILE *
‘stream’ from which to read. - fread will read ‘size’ * ‘nmemb’ bytes from ‘stream’ into ‘ptr’.
#include <stdlib.h> /* exit */
int main() {}
uint8_t bytes[64];
/* Unsafe - needs nullchecks */
FILE *fp = fopen("my_file.txt", "r"); /* read mode */
(bytes, 1, 64, fp); /* read up to 64 * 8 -> 512 bits */
freadreturn 0;
}
~Null checking
fread
has a variety of interesting return values.- It will return the number of members read…
- Not necessarily the number of bytes (if ‘size’ is non-one)
- Returns zero on error, but also…
- Not all files are as large as the read buffer.
- In the above example,
bytes
is the buffer into which we read - the read buffer.
- In the above example,
- On SHA-256, you will have to read files of arbitrary size.
- They will not all be multiples of 512.
- You will have to follow the padding algorithm.
- Checking the return value of
fread
will be necessary to determine l. - Make sure you check
sizeof(size_t)
!
size_t l = 0; /* SHA-256 length variable 'l' */
+= fread(bytes, 1, 64, fp) >> 3; /* 2^3 bits per byte */ l
fprintf
fprintf
is like printf, but the leadingf
allows directing output somewhere other than standard outputstdout
.- I use it here to direct error messages to a special place,
stderr
stderr
is easier to work with in a variety of complicated ways, but…- It is “unbuffered”, so print statements will be output to
stderr
immediately. - Sometimes,
printf
statements are lost if errors occur in immediately successive lines of code. - We can also capture
stderr
output specifically with shell commands:
$ ./endian 2> /dev/null
- I use
fprintf
to write error messages within null checks. - Here is one example, from my sha256 reference solution:
FILE *fp = fopen(f_name, "r"); /* read mode */
if (fp == NULL) {
(stderr, "fopen fails on f_name \"%s\", exiting...\n", f_name);
fprintf(1);
exit}
- High reward and no risk.
- I will use
2> /dev/null
in autograders for your convenience.
fclose
- I am contractually obligated to tell you to close files.
- The interesting bit here is the special
EOF
character. - Read more.
- I refer to
EOF
a lot in my theory courses, as a “special character” that comes up a lot in automata and computability theory.
- I refer to
- Sample code:
if (fclose(fp) == EOF) {
(stderr, "fclose fails on f_name \"%s\", exiting...\n", f_name);
fprintf(1);
exit}
Option Show
Optional Exercise
- Write a C language executable that
- Given a file name at command line…
- Prints out the contents of the file…
- In hexadecimal…
- In chunks of size 512 bytes.
- Sample input:
lipsum.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
- Sample command:
./fileio lipsum.txt
- Sample output:
4c6f7265 6d206970 73756d20 646f6c6f
72207369 7420616d 65742c20 636f6e73
65637465 74757220 61646970 69736369
6e672065 6c69742c 20736564 20646f20
65697573 6d6f6420 74656d70 6f722069
6e636964 6964756e 74207574 206c6162
6f726520 65742064 6f6c6f72 65206d61
676e6120 616c6971 75612e20 55742065
6e696d20 6164206d 696e696d 2076656e
69616d2c 20717569 73206e6f 73747275
64206578 65726369 74617469 6f6e2075
6c6c616d 636f206c 61626f72 6973206e
69736920 75742061 6c697175 69702065
78206561 20636f6d 6d6f646f 20636f6e
73657175 61742e20 44756973 20617574
65206972 75726520 646f6c6f 7220696e
20726570 72656865 6e646572 69742069
6e20766f 6c757074 61746520 76656c69
74206573 73652063 696c6c75 6d20646f
6c6f7265 20657520 66756769 6174206e
756c6c61 20706172 69617475 722e2045
78636570 74657572 2073696e 74206f63
63616563 61742063 75706964 61746174
206e6f6e 2070726f 6964656e 742c2073
756e7420 696e2063 756c7061 20717569
206f6666 69636961 20646573 6572756e
74206d6f 6c6c6974 20616e69 6d206964 20657374 206c6162 6f72756d 2e0a
- It is a simply matter to verify correctness.
- Take the first hexadecimal character -
0x4c
. - Use e.g. Python
chr
. - Get the letter
L
.
chr(0x4c)
- Take the first hexadecimal character -
- Spoilers Solution:
fileio.c
/*
* fileio.c
*
* Write a C language executable that given a file name
* prints out the contents of the file in hexadecimal
* in chunks of size 512 bytes.
*/
/*
gcc fileio.c --std=c89 -Wall -Wextra -Werror -Wpedantic -O2 -o fileio
*/
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
size_t i, l = 0;
uint8_t m[64];
FILE *fp;
if (!(argc > 1)) {
(stderr, "No filename argument provided, exiting...\n");
fprintf(1);
exit}
= fopen(argv[1], "r"); /* read mode */
fp
if (fp == NULL) {
(stderr, "fopen fails on argv[1] \"%s\", exiting...\n", argv[1]);
fprintf(1);
exit}
do {
= fread(m, 1, 64, fp);
l for (i = 0; i < l; i++) {
if (i % 4 == 0 && i > 0) {
(" ");
printf}
if (i % 16 == 0 && i > 0) {
("\n");
printf}
("%02x", m[i]);
printf}
("\n\n");
printf} while (l == 64);
if (fclose(fp) == EOF) {
(stderr, "fclose fails on argv[1] \"%s\", exiting...\n", argv[1]);
fprintf(1);
exit}
return 0;
}
An astute student may note that the “do… while” formulation is unusual.
- In this, we will always print something.
- The empty file (a file of size zero) and no file (the absence of a file of a given name) are distinct.
- Therefore, we always read from the file at least once after null-checking.
- However, we only read again if at least 512 bytes were read…
- Otherwise we read all the content of the file.
The “do… while” formulation is uncommon, but makes this specific use case much easier. Where else might you use a “do… while”? Why?
astyle Show
Optional Utility
- You may find keeping your C code formatted in
vim
frustrating. - Simply exit
vim
and invokeastyle
.
astyle fileio.c
- It is likely not on your system by default.
- Add to the Containerfile and rebuild, or
apt install astyle # this assumes ubuntu, not alpine
- There are other such utilities, but this one is quick and nice.
- If you are using a build script, you can simply add
astyle
.
Endian Show
Endianness is a challenge topic. Be systematic, patient, and indefatigable.
Endianness
- For an out-of-scope reason, most computers store bits differently than most networks.
- SHA-256 analyzes bits on a computer that are slated to go out on a network.
- This topic is called “Endianness”
- We go live to a lightly-formatted LLM generated .md write-up:
LLMs on Endianess
- Endianness refers to the byte order used to store multi-byte data types.
- Like
int
,unigned
, andsize_t
in memory. - There are two main types:
- Little-Endian
- Definition: The least significant byte (LSB) is stored at the lowest memory address.
- Example: The number
0x1234ABCD
is stored in memory as:
1234ABCD
- Editors Note: This is very hard to check, I might add.
- Big-Endian
- Definition: The most significant byte (MSB) is stored at the lowest memory address.
- Example: The number
0x1234ABCD
is stored in memory as:
CDAB3412
An astute student may note that CDAB3412
does not appear to be the reversal of 1234ABCD
- Endianness reverses specifically bytes, 8 bits of information.
- Hexadecimal notation expresses specifically nibs or half-bytes, 4 bits of information.
D
is a nib andCD
is a byte.- ‘1234’ in reverse byte order is
3412
and is !!!NOT!!!4321
4321
would constitute a nib or half-byte reversal.
I recommend thinking of every pair of hexadecimal characters as digit in base 0x100 or base 256 expression of some numerical value.
- Why Endianness Matters
- Data Interoperability: Different systems may use different endianness, causing issues when sharing binary data.
- Networking: The Internet Protocol (IP) uses big-endian (network byte order).
- Editors Note: SHA-256 in practice uses big-endian.
- Detecting Endianness in C
- You can determine the endianness of your system using the following code:
checke.c
#include <stdio.h>
int main() {
unsigned int x = 1;
char *c = (char *)&x;
if (*c) {
("Little-Endian\n");
printf} else {
("Big-Endian\n");
printf}
return 0;
}
Why it matters
- To my knowledge, all reference SHA-256 solutions use big endian.
- My physical device uses little endian, and I suspect yours does too.
arpa/inet.h
- This was a big problem in the early internet days.
- ARPA, now DARPA, more or less launched the modern internet.
- They are responsible for the C89 eligible header
arpa/inet.h
- It contains e.g.
htonl
- “Host to network long”
- In 1989,
uint32_t
was commonly called a (unsigned) long. - Endianness and signedness are non-interactive.
- Recall the Encode lecture on signedness.
- In 2024, often use
htobe32
- “Host to big endian 32bit”
- From C99
endian.h
- As with
%b
, you may use to debug but not to solve your homework.- That said, it is better to turn in a solution using
endian.h
than no solution. - This is an exception to the typical “no partial credit” policy.
- That said, it is better to turn in a solution using
- It contains e.g.
endian.c
/* endian.c - &exists; C99+ <endian.h> be advised */
/*
gcc endian.c -O2 -o endian
*/
#include <stdio.h> /* fileio */
#include <stdint.h> /* uint32_t */
/* Optionally - endianness compatability. */
#include <arpa/inet.h> /* htonl */
#include <endian.h> /* htobe32 */
uint32_t my_htonl(uint32_t n) {
return n;
}
int main() {
uint32_t test = 0x1234ABCD;
("%08X\n%08X\n%08X\n",
printf,
test(test),
htonl(test),
htobe32(test)
my_htonl);
return 0;
}
- We will now step through an example solution to
my_htonl
. - We will leave as an exercise
htonll
- the 64 bit variant. - You can compare against the
htobe64
fromendian.h
- Not using
htobe64
is not just a learning goal. - Removing
endian.h
from my code sped it up 5x in testing.
- Not using
- I note I will assume the “host” is litte endian.
- It is trivial to check, and you may do so, or use a
#define
- It is trivial to check, and you may do so, or use a
Memory
- The core insight of endianness is that it refers to bits in memory.
- Recall we have used the notion of memory a few times:
- Strings in C are characters in adjacent memory locations.
- Bitwise operations act on bits in adjacent memory locations.
- The star
*
operator and array notation[n]
are similar.
- We will rearrange bits using a novel operator -
&
, which negates*
.- We note that unary prefix
&
is a memory operator, distinct from - Binary infix
&
which is a bitwise operator.
- We note that unary prefix
Initial State
- We begin with the indentity operation on
uint32_t
.
uint32_t my_htonl(uint32_t n) {
return n;
}
Variables
- In C89 style, we first declare variables.
- First is an alias for
n
.- We will treat the 32 bit int as an array of 8 bit ints.
- An 8 bit int is of type
uint8_t
- An array is of type
*
- We initialize it to no value, for now.
uint32_t my_htonl(uint32_t n) {
uint8_t *alias;
return n;
}
- We will also be swapping bits around.
- In Python we can use multiple assignment
= y, x x, y
- This only works with key-value pairs, not arrays.
- We use a swapping variable, a la
int
t : = x
t = y
x = t
y del t
- In C:
uint32_t my_htonl(uint32_t n) {
uint8_t swap, *alias;
return n;
}
- We will be looping over something that is quite like an array, so we need an index variable.
- I create this variable of type
size_t
because:- We will compare it to a
sizeof
- We will use it as an array index to denote memory locations.
- So the C implementation will guarantee
size_t
is appropriate for array indices. - You can determine what a
size_t
is using the usual methods for examining C code.
- We will compare it to a
- I create this variable of type
uint32_t my_htonl(uint32_t n) {
uint8_t swap, *alias;
size_t index;
return n;
}
Ampersand
- We will now use the ampersand
&
operator.- We want the
uint8_t *
array alias to refer to the same bits and theuint32_t
numerical valuen
. - The location of the bits that constitute
n
in memory is given by&n
. - So we set the alias equal to that location.
uint32_t my_htonl(uint32_t n) { uint8_t swap, *alias = &n; size_t index; return n; }
- We want the
- If you are confused at all, you should stop and print out, minimally:
- The values of the star, amperand, and unmodified alias and n variables.
- Anything else you think of.
- You should do these all in different lines in case you get an error.
- You should think of the
&n
as the key ofn
in the key-value memory storage of the system.
Casting
- When I run this code, I get this warning:
endian.c: In function ‘my_htonl’:
endian.c:15:26: warning: initialization of ‘uint8_t *’ {aka ‘unsigned char *’} from incompatible pointer type ‘uint32_t *’ {aka ‘unsigned int *’} [-Wincompatible-pointer-types]
15 | uint8_t swap, *arr = &n; |
- That’s a good warning, I am doing something extremely sketchy.
- After all, I told
gcc
thatn
was a 32 bit integer. - Then I told it that the the location
n
was in held an array of 8 bit integers. - Those things can’t both be true.
- After all, I told
- We assure
gcc
that we know what we are doing with a cast- We apply a cast to a value, like the value of the memory location of
n
- A cast is a type name in parenthesis as a prefix.
- A cast assures the compiler of the intentionality of this mapping of bits to variable names.
- We are claiming we want to regard
n
as an array of 8 bit integers, as auint8_t *
.
- We apply a cast to a value, like the value of the memory location of
- So:
uint32_t my_htonl(uint32_t n) {
uint8_t swap, *alias = (uint8_t *)&n;
size_t index;
return n;
}
- This silenced all my
gcc
warnings and errors.
Main loop
- It now suffices to rearrange the bits within 32 bit integer
n
- Helpfully, we can refer to these bits using the aliasing 8 bit integer array
alias
.
uint32_t my_htonl(uint32_t n) {
uint8_t swap, *alias = (uint8_t *)&n;
size_t index;
= alias[0];
swap [0] = alias[3];
alias[3] = swap;
alias= alias[1];
swap [1] = alias[2];
alias[2] = swap;
aliasreturn n;
}
- You will note this contains repeated code, and is therefore de facto wrong, though it runs correctly.
htonll Show
htonll
- Write a 64 bit endianness inverter.
- You should start with your
my_htonl
code if you are stuck.- Refactor the internals into a loop
- Use
sizeof
for the loop termination.
- Advanced students may wish to write a size agnostic
HTON
macro.- This is non-trivial but probably possible (I didn’t check).
memset Show
memset
- The C89 “string.h” library contains a helpful function
memset
.
NAME
memset - fill memory with a constant byte
SYNOPSIS
#include <string.h>
void *memset(void *s, int c, size_t n);
DESCRIPTION
The memset() function fills the first n bytes of the
memory area pointed to by s with the constant byte
c.
RETURN VALUE
The memset() function returns a pointer to the mem‐ ory area s.
- I use
memset
to zero out all my arrays before I use them. - This usually doesn’t matter but is very nice with string data.
memcpy
- The C89 “string.h” library contains a helpful function
memcpy
.- I have never once in my life remembered destination is first, not source.
- I spent an hour trying to figure out a bug caused by that.
NAME
memcpy - copy memory area
SYNOPSIS
#include <string.h>
void *memcpy(void *dest, const void *src, size_t n);
DESCRIPTION
The memcpy() function copies n bytes from memory area src to memory area dest. The memory
areas must not overlap. Use memmove(3) if the memory areas do overlap.
RETURN VALUE
The memcpy() function returns a pointer to dest.
ATTRIBUTES For an explanation of the terms used in this section, see attributes(7).
- I use
memcpy
to move data into and out of SHA internal state:- The working variables.
- The current \(H_i\) hash value.
- The current \(M_i\) message data.
Arrays Show
Arrays
- You will be tempted to write functions of this form:
/* Take two arrays and a return a new array */
char *new_hash(uint32_t *m_i, uint32_t *h_i) {
uint32_t h_i_1[8];
/* some operations */
return h_i_1;
}
- This will cause a segmentation fault for the following reason.
h_i_1
refers to a memory location.- That memory location is a local variable of the
new_hash
function - When
new_hash
returns, it no longer manages it’s local variables. - Therefore, it is no longer safe to access that memory location.
- It is, essentially, reclaimed by the operating system.
- This occurs probabilitistically, but at high probability.
- So the next effort to access a value in
h_i_1
will trigger the OS to terminate your program.
- Here is the C alternative:
/* Take two arrays as arguments and update the second */
void new_hash(uint32_t *m_i, uint32_t *h_i) {
uint32_t h_i_1[8];
/* some operations */
(h_i, h_i_1, sizeof(h_i_1));
memcpyreturn;
}
h_i
is safe because it is managed by some other function.- Likely
main
or some other function that callednew_hash
.
- Likely
- Therefore, it will not be reclaimed by the OS on return from
new_hash
.
Take-aways
- Don’t return arrays.
- Provide arrays as arguments and update the provided arrays.
- There are other ways to do this that we will learn in time.
- But they are unhelpful on SHA-256, and would enable low quality solutions.