How do I make sure my game runs well on Via?
While this is Windows specific and in the context of Via, it ended up being a more general guide to reading files.
Everyone has to read files at the end of the day.
One of Via's big advantages is that it conforms to the existing file system API. This means no special code or
functionality is required for games to run on it. That said, the system does stress the loading code a lot
more than a standard SSD or HDD would.
Device | Bandwidth | Latency |
---|---|---|
NVMe | 7 GB/s | 50 us |
SSD | 600 MB/s | 300 us |
HDD | 150 MB/s | 5 ms |
Via | 10-100 MB/s | 10-100 ms |
Via sits in a bit of a unique position since it doesn't depend on sequential access like hard drives, but players will likely
still have higher latency and lower bandwidth. It is also complicated by the fact that once content is downloaded it is cached
and the performance now depends on the local device speed.
This may come as a surprise, but I have analyzed tons of games at this point and bandwidth is almost never the limiting factor.
All of your attention should be spent on keeping the pipe full of useful data. If you can do that, your game will run great on
Via (and any other device).
Bandwidth | Latency | Min Outstanding |
---|---|---|
10 MB/s | 10 ms | 100 KB |
10 MB/s | 50 ms | 500 KB |
10 MB/s | 100 ms | 1 MB |
50 MB/s | 10 ms | 500 KB |
50 MB/s | 50 ms | 2.5 MB |
50 MB/s | 100 ms | 5 MB |
100 MB/s | 10 ms | 1 MB |
100 MB/s | 50 ms | 5 MB |
100 MB/s | 100 ms | 10 MB |
If you haven't heard of bandwidth delay product, I recommend reading about it here.
Another good resource is the article on IOPS.
Note that the bandwidth delay product is not unique to networks, it is relevant in all systems where you have throughput and latency at play.
It is simply a calculation of how much data you need outstanding on the wire to keep the pipe full. If you undershoot it, you will have bubbles.
Assume, for example, that you only had 100 KB outstanding at a time on a 10 MB/s 50 ms connection. You'd only be getting an effective throughput
of 2 MB/s. That is a 5x reduction in speed. Your load screen just went from 3 minutes to 15.
An easy speed-of-light check for your game's performance can be done by simply seeing how much data is downloaded before you are in game
and playing vs. how long it took. If your game needs 2 gigs before you are in game, the best possible speed on a 10 MB/s connection is 3:20.
This assumes all the data you downloaded was actually useful and necessary. Unlike a traditional up-front download and install, Via gives the
power to developers over what specifically is downloaded and when. If taken advantage of, games could rival the web in time-to-play.
Calling ReadFile()
Reading from a file, how hard could it be?
Given the discussion so far, it should be clear how critical this is to performance.
Consider the following example:
void read_sync(HANDLE file, char* buffer, u64 beg, u32 len) { OVERLAPPED overlapped = {}; overlapped.Pointer = (void*)beg; DWORD bytes_read; assert(ReadFile(file, buffer, len, &bytes_read, &overlapped)); assert(bytes_read == len); }
Nothing should be surprising here, this is the first thing most people type in and never think about it again.
Here is the timeline for a 100 KB read on a 10 MB/s 50 ms connection:
- ReadFile() enters the kernel
- The kernel dispatches to the Via driver
- Via sends a request to the server
- Your thread is put to sleep
- The request reaches the server after 25 ms
- The server starts sending packets back at 10 MB/s
- The first packet is received after 25 ms
- The last packet is received after another 10 ms
- Via completes the request and wakes your thread
- Your thread returns from ReadFile() and continues
A critical observation here is that, no matter what you do, you will pay a full RTT bubble on every request.
Another observation is that the actual time spent transfering is only 10 ms. Latency dominates bandwidth if you aren't careful.
A simple step to improve effective throughput is to just increase your request size.
This isn't practical at a certain point and doesn't actually address the problem, but it at least helps.
As you increase the request size, you get more time spent in step 8, but you will always pay that full RTT bubble in steps 5-7 with sync io.
I would also like to point out that this is not specific to the network or Via.
Yes the network has higher latency than your local device, but your local device is also way higher bandwidth.
If instead this was a 7 GB/s 50 us NVMe drive, your 100 KB read would spend 15 us transfering and 50 us doing nothing.
You'd need ~5 requests outstanding in parallel to keep the pipe full (assuming it scales perfectly, which these devices tend not to do).
Going Async
In Windows, all io is natively asynchronous.
For sync io, the kernel just waits for the async request on your behalf before returning to you.
What would it look like if we did that wait ourselves?
void read_async_and_wait(HANDLE file, char* buffer, u64 beg, u32 len) { OVERLAPPED overlapped = {}; overlapped.Pointer = (void*)beg; DWORD bytes_read; if (ReadFile(file, buffer, len, &bytes_read, &overlapped)) { // even though we did everything async, we completed synchronously // this can happen in Via if the data is already cached in memory assert(bytes_read == len); } else { // either we are running async or we failed, so check the error code assert(GetLastError() == ERROR_IO_PENDING); // wait for completion, this is what the kernel normally does for you assert(GetOverlappedResult(file, &overlapped, &bytes_read, 1)); } }
Simple enough, and now we have the option to not wait.
The last parameter to GetOverlappedResult() is where we specifiy if we want to wait for the result or not.
For a game, this means we can trivially issue reads and process them when they are done later without any blocking.
OVERLAPPED overlapped; u32 read_len; u32 read_pending; u32 start_read(HANDLE file, char* buffer, u64 beg, u32 len) { // back pressure the requestor if another read is already started if (read_pending) { return 0; } overlapped = {}; overlapped.Pointer = (void*)beg; DWORD bytes_read; if (ReadFile(file, buffer, len, &bytes_read, &overlapped)) { assert(bytes_read == len); // even though we did everything async, we completed synchronously // this can happen in Via if the data is already cached in memory on_read_done(...); } else { // either we are running async or we failed, so check the error code assert(GetLastError() == ERROR_IO_PENDING); read_len = len; read_pending = 1; } return 1; } void handle_completions() { // nothing to do, a read isn't pending if (!read_pending) { return; } // this macro tests a field in the OVERLAPPED struct, no call if (!HasOverlappedIoCompleted(&overlapped)) { return; } DWORD bytes_read; assert(GetOverlappedResult(file, &overlapped, &bytes_read, 0)); assert(bytes_read == read_len); read_pending = 0; // do whatever you want with the data on_read_done(...); } void main() { ... // the game loop for (;;) { handle_completions(); // somewhere in your game you may want to start a read if (!start_read(...)) { // we'll have to try again later, a read is already in progress } ... } }
Great, now we can start and complete reads completely asynchronously. Your thread will never block.
The CPU overhead will just be the time it takes to run ReadFile() itself.
This doesn't address the original issue just yet, though. We still only have a single read outstanding at a time.
All we've gotten so far is the ability to overlap other work with that read (which is, of course, a lot better than blocking).
Now we want to overlap reads with other reads.
OVERLAPPED reads_overlapped[64]; u32 reads_len[64]; u64 reads_pending; u32 start_read(HANDLE file, char* buffer, u64 beg, u32 len) { u64 free = ~reads_pending; // back pressure the requestor if we already have our max reads pending if (!free) { return 0; } u32 i = __builtin_ctzll(free); OVERLAPPED* overlapped = reads_overlapped + i; overlapped->Internal = 0; overlapped->InternalHigh = 0; overlapped->Pointer = (void*)beg; DWORD bytes_read; if (ReadFile(file, buffer, len, &bytes_read, overlapped)) { assert(bytes_read == len); // even though we did everything async, we completed synchronously // this can happen in Via if the data is already cached in memory on_read_done(...); } else { // either we are running async or we failed, so check the error code assert(GetLastError() == ERROR_IO_PENDING); reads_len[i] = len; reads_pending |= 1ull << i; } return 1; } void handle_completions() { for (u32 i = 0; i < 64; i++) { if (!(reads_pending & (1ull << i))) continue; if (!HasOverlappedIoCompleted(reads_overlapped + i)) continue; DWORD bytes_read; assert(GetOverlappedResult(file, &overlapped, &bytes_read, 0)); assert(bytes_read == reads_len[i]); reads_pending &= ~(1ull << i); // do whatever you want with the data on_read_done(...); } } void main() { ... // setup a separate event for each read slot for (u32 i = 0; i < 64; i++) { reads_overlapped[i].hEvent = CreateEvent(0, 0, 0, 0); } ... // the game loop for (;;) { handle_completions(); // somewhere in your game you may want to start some reads if (!start_read(...)) { // we'll try again later, too many reads already in progress } ... } }
Really nothing too crazy or complicated, only a few steps away from the blocking ReadFile() call we started with.
Now it is on you to "draw the rest of the owl" and generate enough read requests to keep the pipe full.
The discussion on file formats later should help with that.
One thing to point out is that we are now using a separate event for each read, but why didn't we need an event for the single read case?
This is because when you don't specify a separate event, the kernel will instead use the file itself as it is itself a dispatcher object.
You can't get away with this in the multi read case since we need to be able to differentiate between the reads to know which are done.
Note that I haven't talked at all about what on_read_done() may do, that is left as an exercise for the reader.
I don't know your problem or what you are doing with the data, so that is on you.
I will say, though, that you should avoid trying to generalize and separate the reading code from the system issuing the reads.
You could, for example, just simply tag as much completion info along side the read request as is necessary such that handle_completions()
can dispatch to the right code and run it right there.
So where do threads come in?
Because people default to using blocking sync io (which again isn't how the kernel actually works), they conflate threads and
threading models with io. The two should have nothing to do with each other. Threads should only be relevant when you are talking
about running actual code concurrently on two different CPU cores.
To be super clear, when you kick an async read off there is no "thread" doing the read for you in the background.
What actually happens is as follows:
- The read is queued up with the underlying device (say an NVMe drive)
- In the background, the hardware is processing that request
- When done, the hardware signals the CPU
- An interrupt fires that does as little as possible then schedules a DPC
- The DPC runs right after the interrupt on the same core
- The DPC completes the request and signals the event (or the file itself)
- A user thread sees that the event (or file) is signaled and ready
If you are reaching for threads to try and overlap io, you are just asking for trouble and making the problem far more complicated
than it actually is.
What about memory mapped io?
Programmers seem to be in love with memory mapped files.
I would argue they are a total gimick and shouldn't be used for any serious data problems.
I can imagine ways the kernel could implement them better, but as they are in all modern systems they should be avoided.
As before, here is a timeline for a memory mapped read:
- You call MapViewOfFile()
- Some downstream code accesses part of the mapping for the first time
- The CPU page faults
- The kernel's page fault handler runs
- The kernel dispatches to the Via driver
- Via sends a request to the server
- Your thread is put to sleep
- The request reaches the server after 25 ms
- The server starts sending packets back at 10 MB/s
- The first packet is received after 25 ms
- The last packet is received after another 10 ms
- Via completes the request and wakes your thread
- Your thread restarts at the page fault point and continues
You may notice that steps 5-13 are identical to the steps shown in the synchronous io example earlier.
That is because page faults are synchronous!
Also note that standard page size is 4k but the kernel people know how bad this is for performance,
so they batch up 16k or 32k or sometimes 64k worth of pages around the actual fault when sending the request to Via.
Besides the blocking io on small reads, mmap io also has the problem of being very sloppy.
Programmers aren't explicit about what is and isn't loaded and when, and even just a 4 byte load off a struct may pull 64k.
In the most egregious case a game was pulling 10+ gigs of data during a load and it took almost an hour on 10 ms.
Unfortunately we can't completely dodge memory mapped io since executables themselves are loaded with it.
This can be problematic if your exe is huge (10+ megs) and has dozens of dll's.
The kernel loader isn't very smart about overlapping these things.
Shouldn't we be using Direct Storage?
Sure! But this is kind of orthogonal to the discussion here.
Everything said here is about saturating the underlying device.
Direct Storage (and the win10 ring thing) are about reducing the CPU overhead of ReadFile() by using single producer
single consumer queues to batch up requests and minimize transitions into the kernel.
This is also just like io_uring on Linux, though far more limited (only reads really).
Direct Storage also has GPU side texture compression features (on xbox only I think) which are not relevant to us here.
Prefetching
todo...
How do I design my file formats?
todo: keeping tree's shallow, dependency chains, file sizes, file patches, etc
What about CreateFile()?
todo: cost of opening files, directory sizes, info queries, etc
What about WriteFile()?
todo...
The Future
todo: io priorities, on the fly content updates, shared engine depots, etc