OCTOI Bit errors

Hey,

one long-standing issue with OCTOI is random bit errors.
Somewhere along the (rather complex) chain of osmo-e1d, icE1usb, Digium TE820, yate, DAHDI and maybe the Auerswald PBXes, bit errors get introduced.

This was first noticed when using OCTOI for modem calls, which would randomly get interrupted.
At CCCamp23, OCTOI was used for interconnecting V5 access multiplexers along the field.
The connection there also wasn’t 100% reliable (despite the local and very high quality network).

One issue was identified by @laforge at Camp already:

OCTOI “compresses” the frames by not transmitting timeslots which haven’t changed since the last frame. This saves a lot of bandwidth (E1 is 2 MBit/s after all, which is still a lot even for modern connections), which would otherwise be wasted in idle/flag octets.

This masking/compression mechanism can now be disabled by using the force-send-all-ts option. This has improved reliability at Camp dramatically (but not completely).

Late in the OCTOI development, a RIFO (random-in, first-out) buffer was introduced to re-order incoming frames from the network into linear order again. We noticed that DOCSIS (cable internet) lines in particular often receive packets out-of-order.
Apparently the RIFO mechanism was introduced after the frame unmasking. This leads to frames being unmasked against random, other frames which are in the buffer (instead of the last frame by frame number).

See: Bug #6169: Frame masking against network frame ordering, not frame numbers - osmo-e1d - Open Source Mobile Communications

A quick & dirty patch written by me seems to improve the situation considerably, but there’s still bit errors occurring.

There’s 3 ways of doing BERT checks:

  • BERT hotline in DIVF yate
  • BERT hotline in local yate
  • 2-channel BERT (tester calling itself on the second B channel)

Both the DIVF yate and my personal yate installation have a special number, which just plays a carefully crafted .wav file. This wave file contains a very long (4 hour+) PRBS11 sequence.
A ISDN tester (like the Argus) can then call this number and will receive that PRBS11 sequence and compare it to a locally generated version.

The ISDN tester can also call itself on the other B-channel and then generate it’s own test sequence and listen to it again. This is ideal for testing both directions of a link and will also avoid any problems in the DIY BERT .wav mechanism.

3 tests can be done this way:

  • calling itself locally on the S0 bus (through the Auerswald only) - 2161 - works fine, no bit errors
  • calling itself through the yate (via TE820, DAHDI, yate) - 02161 - already results in bit errors!
  • calling itself through OCTOI (via TE820, DAHDI, my yate, DAHDI again, trunkdev, trunkdev, DAHDI again, DIVF yate and all the way back) - 003065022161 - results in even more bit errors.

So even without any osmo-e1d/OCTOI involvement, there’s already issues occurring.
I will replace the TE820 with a icE1usb temporarily to try and investigate if this behaves better.
After that, I might try to replace yate with asterisk (or similar) to try and check if yate or DAHDI is at fault.

1 Like

Hm. This escalated a bit.
Turns out, QEMU/KVM (at least in the libvirt configuration I was using) can’t pass the icE1usb USB device into a VM (without breaking it).

New test setup:
Raspberry Pi 4, icE1usb, Auerswald, DAHDI

No osmo-e1d, no Digium TE820, no KVM/virtualization.

“Good” news: The problem persists and occurs there as well.
I’ve configured both Asterisk and Yate (retronetworking fork) to just loop incoming calls back to the Auerswald.
I’m then (again) calling the Argus ISDN tester from itself, doing a loopback BERT test.

Asterisk curiously shows a behaviour that I’ve already noticed earlier, in that it will generate a ton of errors in the first ~15 seconds of a fresh connection (which is probably also why T-View’s and Asterisk don’t play nicely together). After that, I could finish a 30min BERT without any further errors.

Yate on the other hand shows the familiar “some errors, some time, at random intervals” picture, which we already know from the DIVF/my personal yate setup.

It seems yate needs some further investigation, then…

1 Like

from yate/modules/server/zapcard.cpp:

	// Set buffers
	struct dahdi_bufferinfo bi;
	bi.txbufpolicy = DAHDI_POLICY_IMMEDIATE;
	bi.rxbufpolicy = DAHDI_POLICY_IMMEDIATE;

Hm. Citing the Digium docs for this:

Immediate Buffer Policy
This policy means that as soon as data arrives in the buffer within DAHDI, it is immediately pushed out even if the kernel is doing something else. If the kernel is busy with something else, the data is lost and never gets to the Digium card.

Pros: Low latency
Cons: Data loss

That (and the settings for the number of buffers and buffer sizes) might not be ideal…

1 Like

Interesting find. At least in libosmo-abis (used by osmo-v5, osmo-bsc, osmo-mgw) we are doing this right (POLICY_FULL). Also, e1-prbs-test and osmo-isdntap are using POLICY_FULL. So it really only seems to be yate which is not.

This is looking pretty promising, indeed:

Full 30min BERT without any issues (so far). More investigation (latency?), etc. needed.

Yeah, that seems like it might be a pretty bad idea (for data use, at least).
Asterisk docs are pretty clear that you want to use the other modes for fax calling, so I’m guessing everything that we’re doing is also meant by that :laughing:

1 Like

Another thing that might require some investigation is the Digium TE820 (8-port E1 PCIe card) driver configuration.

The wct4xxp kernel module has several parameters that look like they might be relevant:

parm:           max_latency:int = 127
parm:           latency:int = 1
parm:           ms_per_irq:int = 1

As we’re both (in DIVF and my personal setup) using a TE820, timing problems there (missed interrupts? etc.) would probably negatively affect the OCTOI/trunkdev setup as well.

IOCTL(SetChannel) running on channel 1 (param=1). 0: Success [0x55b0458e70]
IOCTL(GetParams) running on channel 1 (param=0). 0: Success [0x55b0458e70]
IOCTL(SetBlkSize) running on channel 1 (param=128). 0: Success [0x55b0458e70]
IOCTL(SetBuffers) running on channel 1 (param=1). 0: Success [0x55b0458e70]
IOCTL(FlushBuffers) running on channel 1 (param=7). 0: Success [0x55b0458e70]
IOCTL(FlushBuffers) running on channel 1 (param=7). 0: Success [0x55b0458e70]
IOCTL(SetFormat) running on channel 1 (param=2). 0: Success [0x55b0458e70]
IOCTL(SetToneDetect) running on channel 1 (param=3). 22: Invalid argument [0x55b0458e70]
IOCTL(SetToneDetect) failed on channel 1 (param=3). 25: Inappropriate ioctl for device [0x55b0458e70]

These are the DAHDI IOCTL’s which yate/zapcard runs on the B channels.
Interestingly, the DAHDI_AUDIOMODE flag isn’t getting set anywhere.
That tells DAHDI to treat a channel as purely digital data and disable any kind of funny business (echo cancellation, conferencing hacks and even software volume control (!):

I have updated my pull request with a change that’ll always set the parameter on B channels:

1 Like

The Yate buffer-handling changes were merged, so I deployed them on the OCTOI hub.
First test results are very promising:

A 30min BERT test (loopback, so both directions used!) completes with only 3140 bits wrong and each of these was caused by reordering (with an accompanying log entry, so all of this should be fixed as soon as we have a better osmo-e1d patch).

1 Like

Using the AC’97 Winmodem in my ThinkPad T41 I’m now getting a stable 54,6kBit/s, which is the fastest I’ve ever seen over OCTOI. That’s a good sign I guess!
The handshake still sounds super weird, lot’s of strange pinging sounds which sound like re-negotiations.

Not entirely sure if that’s something between the Winmodem and the PM3 or if it’s related to the link itself.

For the first time the modem connection now also seems to be very stable – I’ve been online (through the PM3’s PPP feature) for the past 15 min without any hiccups.
Maybe it’s time for some Napster soon :stuck_out_tongue:

2023-09-07 18_03_30-Bibliothek
ISDN PPP downloads now run at pretty much line rate as well :grin:
… I think this is actually reporting slightly too much bandwidth, as that would be almost 139 kbit/s (and there’s no way that compression is happening, as this is a HTTPS connection)

1 Like

This is great stuff. Thanks for sharing your progress here so we can follow along.

If I remember correctly, Ithink at least some of those are auto-tuned by the driver at runtime based on missed interrupts.

I can send you another PM3 unit if you want to do some local testing to avoid any potential influence of OCTOI. It’s a fairly small and lightweight 2U enclosure, so not too bulky…

Ah, see, I learned something new about V.90/V.92 modems today:

That strange noise I was hearing in the training sequence is called “DIL” aka Digital Impairment Learning and it’s very much dependent on the modem vendor.
In my case now, it was Agere Softmodem (later known as LSI) (with that rather unusual pinging sound at the end of the re-negotiation)…

I was expecting something more like Ambient Technologies (later Intel) Modems, which is what my modem back in the day sounded like :smiley_cat:

Also: I’ve managed to update my winmodem drivers from 2.1.13 (which IBM shipped for this laptop in 2003) to the latest 2.2.89 version from 2008.
I had to manually edit the .inf driver file to allow it to be installed on the ICH4 AC’97 chipset of my T41.
This has also led to the loss of audio feedback (meaning I can no longer hear the handshake through the speakers, regardless of ATL/ATM settings).
The handshake is now much quicker and the latency has plummeted. With the old driver, I was seeing 1000-1300ms ping times, now I’m down to 380ms (through OCTOI, L2TP and everything).
That’s an excellent real world improvement in usability.

If someone actually wants to listen to the handshake, I’ve put a recording on https://tbspace.de/content/downloads/isdntap-manawyrm-agere.wav
(with stereo seperation no less :grin:)


When looking at the audio levels, I do wonder if there’s something that could be optimized to make better use of the available dynamic range… Then again, 54666 is probably already at the very far edge of what’s possible (and I wonder if 56000 is even achievable in this combination).

1 Like