served/words/statistic-gifs.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197

---
template=post
title=Statistics on Linux with /proc
style=/styles/post.css
style=writing.css

published=2025-03-02 4:00am CST

description=I want to tell you how my statistic gifs are made :)
---

<style>
	.manlink {
		margin-top: -1rem;
	}
</style>

I've been wanting to make a little page for the statistics of my
webserver <i>(the system not the program)</i>. When I started to
research the APIs that I'd need, just on a whim one day with no
intention to start, I got grabbed by it and knew I had to start.

Check it out: <a href="/starlight.html">starlight.html</a>

<h2>a <code>/proc</code> foreword</h2>
The <code>/proc</code> filesystem, on Linux, is a sort of window into
the kernel. It lets you view some pretty detailed information by simply
reading some files (thanks everything-is-a-file linux).

There's a lot of information about it in the man pages.
They might all be in one big one at <code>man proc</code> but,
like how they are on my server, they could be broken into separate pages
for distinct sections.

I have linked the relevant pages at the top of their section. It's a link
to man7.org, which seems to be <i>the</i> source for Linux Kernel man pages
on the internet. man7 is linked from kernel.org which lends it
credibility at least.

<h2>Memory</h2>

<p class="manlink"><a href="https://man7.org/linux/man-pages/man5/proc_meminfo.5.html">man7.org/proc_meminfo</a></p>

This one isn't too hard. I open the file <code>/proc/meminfo</code> and
look for the lines starting with <code>MemTotal</code> and <code>MemAvailable</code>
which are the total memory and currently available memory, respectively. They
are very well named :). For usage, I just subtract available from total.

<h2>Network</h2>

<p class="manlink"><a href="https://man7.org/linux/man-pages/man5/proc_net.5.html">man7.org/proc_net</a></p>

If you <code>cat /proc/net/dev</code> you can see some stats about
your networking interfaces. This is what I parse, with some pain.

I read the bytes columns from the receive and transmit sections.
These are total counts of bytes received since boot, so you'll
have to take two samples and subtract to get the number of bytes
in some time-span.

Looking at it in the terminal, you might assume that the separator
between the columns was a tab character. I sure did! It is not a tab,
but many spaces.

Because of spaces-and-not-tabs
<i>(not the tabs vs. spaces debate of usual, but with similarities)</i>, it proved
to be a bit annoying to parse. It made me finally
pull in a regex crate because I didn't feel like dealing with it
at the time. Eventually&trade; I want to write a skip-arbitrarily-many-whitespace
iterator, but for now <code>regex-lite</code> lives in my <code>Cargo.toml</code>.

<h2>CPU</h2>

<p class="manlink"><a href="https://man7.org/linux/man-pages/man5/proc_stat.5.html">man7.org/proc_stat</a></p>

<code>/proc/stat</code> is the least obvious of the triplet. It has more than
just the CPU's information, but the cpu is what we're after. You'll notice many
CPU lines probably! I'm using the one starting just "cpu" without a number
(cpu0, cpu1, etc.) because I only have the 1 core. If I had more than one core
it'd work similarly, the just-cpu line sums the other ones, but then it could
show >100% usage 'cause it's per-core usage just added together.

First things uh, second? To summarize from the man page:<br />
The units of these values are <i>ticks</i>. There are <code>USER_HZ</code>
ticks per second. On most platforms it's 100 but you can
check the value for your system with <code>sysconf(_SC_CLK_TCK)</code>.

<details>
	<summary style="font-style: italic;">small C program to check _SC_CLK_TCK :)</summary>
	<pre><code>#include &lt;stdio.h&gt;
#include &lt;unistd.h&gt;
int main() {
	printf("USER_HZ is %i", sysconf(_SC_CLK_TCK));
}</code></pre>
</details>

But what columns of data do we use? From <a href="https://stackoverflow.com/a/3017438">this stackoverflow answer</a>
it seems that summing the user, nice, and system columns get you the total ticks.
The user and system make sense to me, time spent in user and system mode,
but what on earth is nice? I sure hope it is.

The Internet tells me to check <code>man nice</code>
(<a href="https://man7.org/linux/man-pages/man1/nice.1.html">man7.org/nice</a>).
That page says that the
nicness of a process can be adjusted to change how the kernel schedules
that process. Making it less nice (down to -20) increases it's priority, and
increasing it's niceness (up to 19) lowers it. I guess that makes sense. Lowering
the niceness makes the process greedier and in want of more attention
from the scheduler? I'm unsure how well that personification tracks to reality, but
it helped me think about it.

The nice column, then, seems to be the time spent in processes that
would go in the user column, but they have a different priority and
I guess differentiating that is important.

Oh, but there might be more columns we want!
There's <a href="https://stackoverflow.com/a/10794088">another S.O. answer</a>
that I found while writing this that says the sixth and seventh columns should used
as well. These are irq/softirq and are time spent servicing
interrupts. I think it makes sense we'd want that, too.

So you have all these columns&mdash;user, nice, system, irq,
and softirq&mdash;that add together to give you the total number
of ticks spent Doing Things since boot, and you have the number
of ticks in a second. Can you see where I'm going with this?

Yup, take two samples some time span apart, subtract the former
from the later, and then you have how much time the processor spent
Doing Things. You can use that and the number of ticks in your time
span to calculate utilization. Or you just have how much actual time
The Computer spent Doing Work which is also pretty neat. Maybe you
can pay it an hourly wage. Is that just AWS?

Something to watch out for:<br />
apparently the numbers in <code>/proc/stat</code> can overflow and
wrap back to zero. I don't know what size integers they are so I'm
unsure how real of a risk that is, but it seemed worth mentioning here.

<h2>So you've parsed the stats, now to graphs!</h2>

My main trouble here was selecting a range that makes sense for
the data it's representing.

Again, memory was easy. There is a
total, normally-unchanging amount of RAM, so I just use that as
the max. Perhaps there's something to be said about zooming further
in to see the megabyte-by-megabyte variance, but I am much more
interested in a "how close am I to the ceiling" kind of graph. Like,
would I hit my head if I jumped? that kind of thing.

The CPU graph, though, that's very variable and a bit spiky.
I don't <i>really</i> care what the max value was if it's a spike,
it can go off the top for all I care, what I want to see is the
typical usage.

If I just ranged to the max then I'd have what I call The Linode
Problem. I call it that, rather predictably, because that's what
Linode's graphs do and it makes them kind of useless? Great, I love
to see that spike up to 100%, but that's <i>all</i> that I can see now.

So instead of max-grabbing, I sort the data and take the value that's
<i>almost</i> max. My series are 256 samples long, so what this looked
like was taking the 240th value in the array, getting the closest-highest
percent, and using that as the top of the range.

This <i>does</i> mean if it's <i>very</i> spiky, I get The Linode Problem
again, but in that case I'm kind of okay with it. I sample every minute,
so my 256 pixel long graphs are roughly 4 hours long. If it spikes more
than 16 times in that period, perhaps that's worth looking into.

Okay, CPU done. Network time! It's the same, pretty much. Where there was
one line, there are now two. And lots more spikes! I combine the receive
and transmit series into one <code>vec</code>, sort it, and take the 32nd
highest value.

I draw the area under the line, too, because it was nigh impossible to see
the line when it was so.. discontinuous? We get another problem with that,
though, where the second-drawn line-and-underfill will obscure the one
drawn first. So then, to not overdraw an entire measurement, I try to draw
the average-larger one first. Which is to say, I take the average of both
series separately and draw the one with the bigger average first. That way
the smaller one will hopefully nestle under the larger, like a baby bird
hiding from the rain under their parents wing.

<hr class="asterism-dash" />

That's how the range selection works, anyway.

The graphs themselves are drawn on 256x160 gif because i like gif, 256 is
a good number, and they seem to compress better than png in this use case.

One day I'd love to try and generate alternative text to describe
the general look of the graph. "The memory usage is steady at 300MB",
or something like "The network usage is variable, but averages 15.4kbps".

That's it!<br />
bye :)