viona: multiqueue device should stay multiqueue across migration#1121
viona: multiqueue device should stay multiqueue across migration#1121
Conversation
We correctly export and import the propolis-side state given a multiqueue VirtIO device, but we did not communicate that through to viona. On import the virtio-nic device has only told viona it will use one queue pair; we've skipped the "normal" set_features() in favor of setting features on the handle directly. Setting an imported multi-queue device running at this point will reset the many queues, but viona rings are not in a resettable state, fail to reset, and set the device to NEEDS_RESET immediately. Communicating the correct number of queue pairs to viona is a clear improvement, but we're not quite out of bugs yet..
|
between 70662f7 and 0b0566f I noticed an exciting issue very much like #1045: rebooting a guest would cause the PCI device to disappear. since this would have been, if everything else was in a happy state, a bug introduced by #1047... |
| delete_vnic(&vnic_name); | ||
| create_vnic(&underlying_nic, &vnic_name); |
There was a problem hiding this comment.
this is incidental but seems worth keeping: I was trying to chase out any sources of anything potentially sticking around across the "migrated" VMs, and the test vnic itself "could" stick around but really shouldn't.
the dladm commands won't (shouldn't?) block or be blocked by test operations, so I don't super love blocking the runtime like this but I'm also not worried about it..
lgfa29
left a comment
There was a problem hiding this comment.
Giving mostly a cerimonial ✅ since I'm not too familiar with this part of the code, but the general logic makes sense to me.
| let Some(pairs) = (queues - 1).checked_div(2) else { | ||
| return Err(MigrateStateError::ImportFailed(format!( | ||
| "source queue count was not a number of pairs + 1: {queues}" | ||
| ))); | ||
| }; |
There was a problem hiding this comment.
I may be misreading this, but is the goal to check for values like 2 * p + 1, where p is the number of queue pairs? If that's the case, we may need to verify checked_rem is Some(0) as well?
There was a problem hiding this comment.
that's a fair question, I'm not sure what would happen if a guest tried to set the device to use 0 tx/rx pairs.. obviously it wouldn't do useful networking, but that's maybe a legal configuration? and if so we should maintain it across migration. whatever happens, it would be good to have a test for!
I'll check out what the spec says, if anything. if viona tolerates this, then we should faithfully export/import it (and this should have a note), but if it's not allowed then.. yeah we should be more strict about it.
There was a problem hiding this comment.
Yeah, 0 could be a problem, but reading the code I was thinking more of cases where there aren't X pairs + 1? For example, queues = 4 (or any other even number) doesn't sound like a valid value based on the error message, but the code is only checking for overflow errors.
But I don't know how this part works, so I'm mostly going by the code and error messages. Feel free to ignore it if this doesn't make sense.
There was a problem hiding this comment.
oh. uh. yeah. I was expecting checked_div to return None numerator does not evenly divide by 2 but it just truncates. definitely needs a % or checked_rem() or something for that.
We correctly export and import the propolis-side state given a multiqueue VirtIO device, but we did not communicate that through to viona. On import the virtio-nic device has only told viona it will use one queue pair; we've skipped the "normal" set_features() in favor of setting features on the handle directly. Setting an imported multi-queue device running at this point will reset the many queues, but viona rings are not in a resettable state, fail to reset, and set the device to NEEDS_RESET immediately.
Communicating the correct number of queue pairs to viona is a clear improvement, but we're not quite out of bugs yet..