Skip to content

viona: multiqueue device should stay multiqueue across migration#1121

Open
iximeow wants to merge 6 commits intomasterfrom
ixi/viona-import-usepairs
Open

viona: multiqueue device should stay multiqueue across migration#1121
iximeow wants to merge 6 commits intomasterfrom
ixi/viona-import-usepairs

Conversation

@iximeow
Copy link
Copy Markdown
Member

@iximeow iximeow commented Apr 21, 2026

We correctly export and import the propolis-side state given a multiqueue VirtIO device, but we did not communicate that through to viona. On import the virtio-nic device has only told viona it will use one queue pair; we've skipped the "normal" set_features() in favor of setting features on the handle directly. Setting an imported multi-queue device running at this point will reset the many queues, but viona rings are not in a resettable state, fail to reset, and set the device to NEEDS_RESET immediately.

Communicating the correct number of queue pairs to viona is a clear improvement, but we're not quite out of bugs yet..

We correctly export and import the propolis-side state given a
multiqueue VirtIO device, but we did not communicate that through to
viona. On import the virtio-nic device has only told viona it will use
one queue pair; we've skipped the "normal" set_features() in favor of
setting features on the handle directly. Setting an imported multi-queue
device running at this point will reset the many queues, but viona rings
are not in a resettable state, fail to reset, and set the device to
NEEDS_RESET immediately.

Communicating the correct number of queue pairs to viona is a clear
improvement, but we're not quite out of bugs yet..
@iximeow iximeow added bug Something that isn't working. networking Related to networking devices/backends. migration Issues related to live migration. labels Apr 21, 2026
@iximeow
Copy link
Copy Markdown
Member Author

iximeow commented Apr 23, 2026

between 70662f7 and 0b0566f I noticed an exciting issue very much like #1045: rebooting a guest would cause the PCI device to disappear. since peak wasn't retained across VirtQueues export/import, we'd assume any previously-initialized queues beyond len were never touched and not reset them. the guest then sees them as they were at import - enabled - after a "device reset" and rightfully refuses to operate the accursed device.

this would have been, if everything else was in a happy state, a bug introduced by #1047...

Comment on lines +1555 to +1556
delete_vnic(&vnic_name);
create_vnic(&underlying_nic, &vnic_name);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incidental but seems worth keeping: I was trying to chase out any sources of anything potentially sticking around across the "migrated" VMs, and the test vnic itself "could" stick around but really shouldn't.

the dladm commands won't (shouldn't?) block or be blocked by test operations, so I don't super love blocking the runtime like this but I'm also not worried about it..

Copy link
Copy Markdown
Member

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving mostly a cerimonial ✅ since I'm not too familiar with this part of the code, but the general logic makes sense to me.

Comment on lines +945 to +949
let Some(pairs) = (queues - 1).checked_div(2) else {
return Err(MigrateStateError::ImportFailed(format!(
"source queue count was not a number of pairs + 1: {queues}"
)));
};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be misreading this, but is the goal to check for values like 2 * p + 1, where p is the number of queue pairs? If that's the case, we may need to verify checked_rem is Some(0) as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a fair question, I'm not sure what would happen if a guest tried to set the device to use 0 tx/rx pairs.. obviously it wouldn't do useful networking, but that's maybe a legal configuration? and if so we should maintain it across migration. whatever happens, it would be good to have a test for!

I'll check out what the spec says, if anything. if viona tolerates this, then we should faithfully export/import it (and this should have a note), but if it's not allowed then.. yeah we should be more strict about it.

Copy link
Copy Markdown
Member

@lgfa29 lgfa29 May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 0 could be a problem, but reading the code I was thinking more of cases where there aren't X pairs + 1? For example, queues = 4 (or any other even number) doesn't sound like a valid value based on the error message, but the code is only checking for overflow errors.

But I don't know how this part works, so I'm mostly going by the code and error messages. Feel free to ignore it if this doesn't make sense.

Copy link
Copy Markdown
Member Author

@iximeow iximeow May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. uh. yeah. I was expecting checked_div to return None numerator does not evenly divide by 2 but it just truncates. definitely needs a % or checked_rem() or something for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that isn't working. migration Issues related to live migration. networking Related to networking devices/backends.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants