[RFC] unbound: Increase timeout value for unknown dns-server

Message ID 20210106101742.6561-1-jonatan.schlag@ipfire.org
State Rejected
Headers
Series [RFC] unbound: Increase timeout value for unknown dns-server |

Commit Message

Jonatan Schlag Jan. 6, 2021, 10:17 a.m. UTC
  When unbound has no information about a DNS-server
a timeout of 376 msec is assumed. This works well in a lot of situations,
but they mention in their documentation that this could be way too low.
They recommend a timeout of 1126 msec for satellite connections
(https://nlnetlabs.nl/documentation/unbound/unbound.conf).
Settings this value to 1126 msec should make the first queries to an
unknown server, more useful.
They do not timeout and so these queries do not need to be sent again.

On a stable link, this behaviour should not have negative implications.
As the first result of queries arrive the timeout value gets updated,
and the high value of 1126 msec gets set to something useful.

Signed-off-by: Jonatan Schlag <jonatan.schlag@ipfire.org>
---
 config/unbound/unbound.conf | 1 +
 1 file changed, 1 insertion(+)
  

Comments

Paul Simmons Jan. 6, 2021, 12:02 p.m. UTC | #1
On 1/6/21 4:17 AM, Jonatan Schlag wrote:
> When unbound has no information about a DNS-server
> a timeout of 376 msec is assumed. This works well in a lot of situations,
> but they mention in their documentation that this could be way too low.
> They recommend a timeout of 1126 msec for satellite connections
> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
> Settings this value to 1126 msec should make the first queries to an
> unknown server, more useful.
> They do not timeout and so these queries do not need to be sent again.
>
> On a stable link, this behaviour should not have negative implications.
> As the first result of queries arrive the timeout value gets updated,
> and the high value of 1126 msec gets set to something useful.
>
> Signed-off-by: Jonatan Schlag <jonatan.schlag@ipfire.org>
> ---
>   config/unbound/unbound.conf | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf
> index f78aaae8c..02f093015 100644
> --- a/config/unbound/unbound.conf
> +++ b/config/unbound/unbound.conf
> @@ -62,6 +62,7 @@ server:
>   
>   	# Timeout behaviour
>   	infra-keep-probing: yes
> +	unknown-server-time-limit: 1128
>   
>   	# Bootstrap root servers
>   	root-hints: "/etc/unbound/root.hints"

This sounds promising to me, as I have many DNS lookup timeouts (ISP is 
HughesNot, er, HughesNet).

+1

Paul
  
Michael Tremer Jan. 6, 2021, 3:14 p.m. UTC | #2
Hello,

> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
> 
> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>> When unbound has no information about a DNS-server
>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>> but they mention in their documentation that this could be way too low.
>> They recommend a timeout of 1126 msec for satellite connections
>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>> Settings this value to 1126 msec should make the first queries to an
>> unknown server, more useful.
>> They do not timeout and so these queries do not need to be sent again.
>> 
>> On a stable link, this behaviour should not have negative implications.
>> As the first result of queries arrive the timeout value gets updated,
>> and the high value of 1126 msec gets set to something useful.
>> 
>> Signed-off-by: Jonatan Schlag <jonatan.schlag@ipfire.org>
>> ---
>>  config/unbound/unbound.conf | 1 +
>>  1 file changed, 1 insertion(+)
>> 
>> diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf
>> index f78aaae8c..02f093015 100644
>> --- a/config/unbound/unbound.conf
>> +++ b/config/unbound/unbound.conf
>> @@ -62,6 +62,7 @@ server:
>>    	# Timeout behaviour
>>  	infra-keep-probing: yes
>> +	unknown-server-time-limit: 1128
>>    	# Bootstrap root servers
>>  	root-hints: "/etc/unbound/root.hints"

I am not entirely sure what this is supposed to fix.

It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.

So what is this changing in real life?

> This sounds promising to me, as I have many DNS lookup timeouts (ISP is HughesNot, er, HughesNet).

@Paul: I am not sure if the solution is to increase timeouts. In my point of view, you should change the name servers.

> 
> +1
> 
> Paul
  
Tapani Tarvainen Jan. 6, 2021, 4:19 p.m. UTC | #3
On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:

> > On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
> > 
> > On 1/6/21 4:17 AM, Jonatan Schlag wrote:
> >> When unbound has no information about a DNS-server
> >> a timeout of 376 msec is assumed. This works well in a lot of situations,
> >> but they mention in their documentation that this could be way too low.
> >> They recommend a timeout of 1126 msec for satellite connections
> >> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).

A small nit, they actually suggest 1128 ... and that's indeed what
the patch has:

> >> +	unknown-server-time-limit: 1128

But that's trivial. The point:

> I am not entirely sure what this is supposed to fix.

> It is possible that a DNS response takes longer than 376ms, indeed.
> Does it harm us if we send another packet? No.

If you are behind a slow satellite link, it can take more than that
*every time*. So you would always have sent another query before
getting a response to the previous one.

With TCP that would mean never getting a response, because you'd
always terminate the connection too soon. With UDP, I'm not sure,
depends on how unbound handles incoming responses to queries it's
already deemed lost and sent again. Adjusting delay-close might help.
But it may be it would not work at all when the limit is too small.

That would mean that someone installing IPFire in some remote location
with a slow link would conclude that it just doesn't work.

The downside of increasing the limit is that sometimes replies will
take longer when a packet is lost on the way because we'd wait longer
before re-sending. So it should not be increased too much either.

I don't have data to judge what the limit should be, but I'd tend to
trust nllabs recommendation here and go with the suggested 1128 ms.
  
Michael Tremer Jan. 6, 2021, 6:01 p.m. UTC | #4
Hello,

> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
> 
> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
> 
>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>> 
>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>> When unbound has no information about a DNS-server
>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>> but they mention in their documentation that this could be way too low.
>>>> They recommend a timeout of 1126 msec for satellite connections
>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
> 
> A small nit, they actually suggest 1128 ... and that's indeed what
> the patch has:
> 
>>>> +	unknown-server-time-limit: 1128
> 
> But that's trivial. The point:
> 
>> I am not entirely sure what this is supposed to fix.
> 
>> It is possible that a DNS response takes longer than 376ms, indeed.
>> Does it harm us if we send another packet? No.
> 
> If you are behind a slow satellite link, it can take more than that
> *every time*. So you would always have sent another query before
> getting a response to the previous one.

True, but aren’t these extra-ordinary circumstances?

On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.

If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.

> With TCP that would mean never getting a response, because you'd
> always terminate the connection too soon. With UDP, I'm not sure,
> depends on how unbound handles incoming responses to queries it's
> already deemed lost and sent again. Adjusting delay-close might help.
> But it may be it would not work at all when the limit is too small.
> 
> That would mean that someone installing IPFire in some remote location
> with a slow link would conclude that it just doesn't work.
> 
> The downside of increasing the limit is that sometimes replies will
> take longer when a packet is lost on the way because we'd wait longer
> before re-sending. So it should not be increased too much either.
> 
> I don't have data to judge what the limit should be, but I'd tend to
> trust nllabs recommendation here and go with the suggested 1128 ms.

Did anyone actually experience some problems here that this needs changing?

@Jonatan: What is your motivation for this patch?

> 
> -- 
> Tapani Tarvainen
  
Jon Murphy Jan. 6, 2021, 6:59 p.m. UTC | #5
> On Jan 6, 2021, at 12:01 PM, Michael Tremer <michael.tremer@ipfire.org> wrote:
> 
> Did anyone actually experience some problems here that this needs changing?


Maybe here?

https://community.ipfire.org/t/override-disable-dnssec-system/2717 <https://community.ipfire.org/t/override-disable-dnssec-system/2717>
<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Jan 6, 2021, at 12:01 PM, Michael Tremer &lt;<a href="mailto:michael.tremer@ipfire.org" class="">michael.tremer@ipfire.org</a>&gt; wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="caret-color: rgb(0, 0, 0); font-family: Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none; float: none; display: inline !important;" class="">Did anyone actually experience some problems here that this needs changing?</span><br style="caret-color: rgb(0, 0, 0); font-family: Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""></div></blockquote></div><br class=""><div class=""><br class=""></div><div class="">Maybe here?</div><div class=""><br class=""></div><div class=""><a href="https://community.ipfire.org/t/override-disable-dnssec-system/2717" class="">https://community.ipfire.org/t/override-disable-dnssec-system/2717</a></div><div class=""><br class=""></div><div class=""><br class=""></div></body></html>
  
Michael Tremer Jan. 7, 2021, 11:27 a.m. UTC | #6
Hello Jon,

Yes, that could be true.

Can someone reach out to that user and see if they can apply the change and confirm that this works?

-Michael

> On 6 Jan 2021, at 18:59, Jon Murphy <jcmurphy26@gmail.com> wrote:
> 
> 
> 
>> On Jan 6, 2021, at 12:01 PM, Michael Tremer <michael.tremer@ipfire.org> wrote:
>> 
>> Did anyone actually experience some problems here that this needs changing?
> 
> 
> Maybe here?
> 
> https://community.ipfire.org/t/override-disable-dnssec-system/2717
> 
>
  
Tapani Tarvainen Jan. 7, 2021, 2:35 p.m. UTC | #7
Inasmuch as the need for this is likely to be rare and potentially at
least slightly harmful to normal users, perhaps it would be sufficient
to suggest in the documentation that people who need it simply add
their preferred unknown-server-time-limit setting to a file in
/etc/unbound/local.d?

It would be an easy way to test it, too.

Tapani

On Thu, Jan 07, 2021 at 11:27:43AM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
> 
> Hello Jon,
> 
> Yes, that could be true.
> 
> Can someone reach out to that user and see if they can apply the change and confirm that this works?
> 
> -Michael
> 
> > On 6 Jan 2021, at 18:59, Jon Murphy <jcmurphy26@gmail.com> wrote:
> > 
> > 
> > 
> >> On Jan 6, 2021, at 12:01 PM, Michael Tremer <michael.tremer@ipfire.org> wrote:
> >> 
> >> Did anyone actually experience some problems here that this needs changing?
> > 
> > 
> > Maybe here?
> > 
> > https://community.ipfire.org/t/override-disable-dnssec-system/2717
> > 
> >
  
Michael Tremer Jan. 7, 2021, 2:54 p.m. UTC | #8
Hello,

Yes that would be the easiest way to test this.

But in general I do not recommend to have local changes like this permanently because they might break things.

-Michael

> On 7 Jan 2021, at 14:35, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
> 
> Inasmuch as the need for this is likely to be rare and potentially at
> least slightly harmful to normal users, perhaps it would be sufficient
> to suggest in the documentation that people who need it simply add
> their preferred unknown-server-time-limit setting to a file in
> /etc/unbound/local.d?
> 
> It would be an easy way to test it, too.
> 
> Tapani
> 
> On Thu, Jan 07, 2021 at 11:27:43AM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>> 
>> Hello Jon,
>> 
>> Yes, that could be true.
>> 
>> Can someone reach out to that user and see if they can apply the change and confirm that this works?
>> 
>> -Michael
>> 
>>> On 6 Jan 2021, at 18:59, Jon Murphy <jcmurphy26@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> On Jan 6, 2021, at 12:01 PM, Michael Tremer <michael.tremer@ipfire.org> wrote:
>>>> 
>>>> Did anyone actually experience some problems here that this needs changing?
>>> 
>>> 
>>> Maybe here?
>>> 
>>> https://community.ipfire.org/t/override-disable-dnssec-system/2717
>>> 
>>>
  
Paul Simmons Jan. 8, 2021, 8:25 a.m. UTC | #9
On 1/6/21 9:14 AM, Michael Tremer wrote:
> Hello,
>
>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>
>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>> When unbound has no information about a DNS-server
>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>> but they mention in their documentation that this could be way too low.
>>> They recommend a timeout of 1126 msec for satellite connections
>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>> Settings this value to 1126 msec should make the first queries to an
>>> unknown server, more useful.
>>> They do not timeout and so these queries do not need to be sent again.
>>>
>>> On a stable link, this behaviour should not have negative implications.
>>> As the first result of queries arrive the timeout value gets updated,
>>> and the high value of 1126 msec gets set to something useful.
>>>
>>> Signed-off-by: Jonatan Schlag <jonatan.schlag@ipfire.org>
>>> ---
>>>   config/unbound/unbound.conf | 1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf
>>> index f78aaae8c..02f093015 100644
>>> --- a/config/unbound/unbound.conf
>>> +++ b/config/unbound/unbound.conf
>>> @@ -62,6 +62,7 @@ server:
>>>     	# Timeout behaviour
>>>   	infra-keep-probing: yes
>>> +	unknown-server-time-limit: 1128
>>>     	# Bootstrap root servers
>>>   	root-hints: "/etc/unbound/root.hints"
> I am not entirely sure what this is supposed to fix.
>
> It is possible that a DNS response takes longer than 376ms, indeed. Does it harm us if we send another packet? No.
>
> So what is this changing in real life?
>
>> This sounds promising to me, as I have many DNS lookup timeouts (ISP is HughesNot, er, HughesNet).
> @Paul: I am not sure if the solution is to increase timeouts. In my point of view, you should change the name servers.
>
>> +1
>>
>> Paul

Greetings, Michael.  The two DNS servers I use have ping times of 631ms 
(addr 9.9.9.10) and 742ms (addr 81.3.27.54).

I tested the ping times of the first 27 IPV4 address of servers listed 
in the wiki.

The times ranged from 596ms to 857ms, so I question if changing servers 
will afford any measurable relief.

Thank you,

Paul
  
Jonatan Schlag Jan. 8, 2021, 5:33 p.m. UTC | #10
Hi,

I will try to provide some explanations to the questions.

> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:
> 
> Hello,
> 
>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
>> 
>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>> 
>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>>> 
>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>> When unbound has no information about a DNS-server
>>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>>> but they mention in their documentation that this could be way too low.
>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>> 
>> A small nit, they actually suggest 1128 ... and that's indeed what
>> the patch has:
>> 
>>>>> +    unknown-server-time-limit: 1128
>> 
>> But that's trivial. The point:
>> 
>>> I am not entirely sure what this is supposed to fix.
>> 
>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>> Does it harm us if we send another packet? No.
>> 
>> If you are behind a slow satellite link, it can take more than that
>> *every time*.
This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here. 

https://nlnetlabs.nl/documentation/unbound/info-timeout/

When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. 
The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.

So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable. 

When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right. 

>> So you would always have sent another query before
>> getting a response to the previous one.
> 
> True, but aren’t these extra-ordinary circumstances?
> 
>> On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
>> 
>> If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
>> 
>> With TCP that would mean never getting a response, because you'd
>> always terminate the connection too soon. With UDP, I'm not sure,
>> depends on how unbound handles incoming responses to queries it's
>> already deemed lost and sent again. Adjusting delay-close might help.
>> But it may be it would not work at all when the limit is too small.
>> 
>> That would mean that someone installing IPFire in some remote location
>> with a slow link would conclude that it just doesn't work.
>> 
>> The downside of increasing the limit is that sometimes replies will
>> take longer when a packet is lost on the way because we'd wait longer
>> before re-sending. So it should not be increased too much either.
This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
>> 
>> I don't have data to judge what the limit should be, but I'd tend to
>> trust nllabs recommendation here and go with the suggested 1128 ms.
> 
> Did anyone actually experience some problems here that this needs changing?
> 
> @Jonatan: What is your motivation for this patch?

Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
But people should first check the current infra cache as these values would determine if this setting would help.

I hope a could make some things a little bit more clear.

Greetings Jonatan   
> 
>> 
>> -- 
>> Tapani Tarvainen
>
<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr">Hi,</div><div dir="ltr"><br></div><div dir="ltr">I will try to provide some explanations to the questions.</div><div dir="ltr"><br><blockquote type="cite">Am 06.01.2021 um 19:01 schrieb Michael Tremer &lt;michael.tremer@ipfire.org&gt;:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><span>Hello,</span><br><span></span><br><blockquote type="cite"><span>On 6 Jan 2021, at 16:19, Tapani Tarvainen &lt;ipfire@tapanitarvainen.fi&gt; wrote:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>On 6 Jan 2021, at 12:02, Paul Simmons &lt;mbatranch@gmail.com&gt; wrote:</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span></span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>On 1/6/21 4:17 AM, Jonatan Schlag wrote:</span><br></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>When unbound has no information about a DNS-server</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>a timeout of 376 msec is assumed. This works well in a lot of situations,</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>but they mention in their documentation that this could be way too low.</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>They recommend a timeout of 1126 msec for satellite connections</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>(https://nlnetlabs.nl/documentation/unbound/unbound.conf).</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>A small nit, they actually suggest 1128 ... and that's indeed what</span><br></blockquote><blockquote type="cite"><span>the patch has:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><blockquote type="cite"><span>+ &nbsp; &nbsp;unknown-server-time-limit: 1128</span><br></blockquote></blockquote></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>But that's trivial. The point:</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span>I am not entirely sure what this is supposed to fix.</span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><blockquote type="cite"><span>It is possible that a DNS response takes longer than 376ms, indeed.</span><br></blockquote></blockquote><blockquote type="cite"><blockquote type="cite"><span>Does it harm us if we send another packet? No.</span><br></blockquote></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>If you are behind a slow satellite link, it can take more than that</span><br></blockquote><blockquote type="cite"><span>*every time*. </span></blockquote></div></blockquote><div>This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.&nbsp;</div><div><br></div><div><a href="https://nlnetlabs.nl/documentation/unbound/info-timeout/">https://nlnetlabs.nl/documentation/unbound/info-timeout/</a></div><div><br></div><div>When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site.&nbsp;</div><div>The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.</div><div><br></div><div>So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.&nbsp;</div><div><br></div><div>When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.&nbsp;</div><br><blockquote type="cite"><div dir="ltr"><blockquote type="cite"><span>So you would always have sent another query before</span><br></blockquote><blockquote type="cite"><span>getting a response to the previous one.</span><br></blockquote><span></span><br><span>True, but aren’t these extra-ordinary circumstances?</span><br><span></span><br><span>On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.</span><br><span></span><br><span>If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.</span><br><span></span><br><blockquote type="cite"><span>With TCP that would mean never getting a response, because you'd</span><br></blockquote><blockquote type="cite"><span>always terminate the connection too soon. With UDP, I'm not sure,</span><br></blockquote><blockquote type="cite"><span>depends on how unbound handles incoming responses to queries it's</span><br></blockquote><blockquote type="cite"><span>already deemed lost and sent again. Adjusting delay-close might help.</span><br></blockquote><blockquote type="cite"><span>But it may be it would not work at all when the limit is too small.</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>That would mean that someone installing IPFire in some remote location</span><br></blockquote><blockquote type="cite"><span>with a slow link would conclude that it just doesn't work.</span><br></blockquote><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>The downside of increasing the limit is that sometimes replies will</span><br></blockquote><blockquote type="cite"><span>take longer when a packet is lost on the way because we'd wait longer</span><br></blockquote><blockquote type="cite"><span>before re-sending. So it should not be increased too much either.</span><br></blockquote></div></blockquote>This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.<br><blockquote type="cite"><div dir="ltr"><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>I don't have data to judge what the limit should be, but I'd tend to</span><br></blockquote><blockquote type="cite"><span>trust nllabs recommendation here and go with the suggested 1128 ms.</span><br></blockquote><span></span><br><span>Did anyone actually experience some problems here that this needs changing?</span><br><span></span><br><span>@Jonatan: What is your motivation for this patch?</span><br></div></blockquote><div><br></div>Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.<div>But people should first check the current infra cache as these values would determine if this setting would help.</div><div><br></div><div>I hope a could make some things a little bit more clear.</div><div><br></div><div>Greetings Jonatan &nbsp;&nbsp;<br><blockquote type="cite"><div dir="ltr"><span></span><br><blockquote type="cite"><span></span><br></blockquote><blockquote type="cite"><span>-- </span><br></blockquote><blockquote type="cite"><span>Tapani Tarvainen</span><br></blockquote><span></span><br></div></blockquote></div></body></html>
  
Michael Tremer Jan. 9, 2021, 3:04 p.m. UTC | #11
Hi,

In that case, I do not think that this change realistically changes anything for anyone.

In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.

He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.

Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.

We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.

-Michael

> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag@ipfire.org> wrote:
> 
> Hi,
> 
> I will try to provide some explanations to the questions.
> 
>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:
>> 
>> Hello,
>> 
>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
>>> 
>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>>> 
>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>>>> 
>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>> When unbound has no information about a DNS-server
>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>>>> but they mention in their documentation that this could be way too low.
>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>> 
>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>> the patch has:
>>> 
>>>>>> +    unknown-server-time-limit: 1128
>>> 
>>> But that's trivial. The point:
>>> 
>>>> I am not entirely sure what this is supposed to fix.
>>> 
>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>> Does it harm us if we send another packet? No.
>>> 
>>> If you are behind a slow satellite link, it can take more than that
>>> *every time*. 
> This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here. 
> 
> https://nlnetlabs.nl/documentation/unbound/info-timeout/
> 
> When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site. 
> The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
> 
> So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable. 
> 
> When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right. 
> 
>>> So you would always have sent another query before
>>> getting a response to the previous one.
>> 
>> True, but aren’t these extra-ordinary circumstances?
>> 
>> On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
>> 
>> If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
>> 
>>> With TCP that would mean never getting a response, because you'd
>>> always terminate the connection too soon. With UDP, I'm not sure,
>>> depends on how unbound handles incoming responses to queries it's
>>> already deemed lost and sent again. Adjusting delay-close might help.
>>> But it may be it would not work at all when the limit is too small.
>>> 
>>> That would mean that someone installing IPFire in some remote location
>>> with a slow link would conclude that it just doesn't work.
>>> 
>>> The downside of increasing the limit is that sometimes replies will
>>> take longer when a packet is lost on the way because we'd wait longer
>>> before re-sending. So it should not be increased too much either.
> This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
>>> 
>>> I don't have data to judge what the limit should be, but I'd tend to
>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>> 
>> Did anyone actually experience some problems here that this needs changing?
>> 
>> @Jonatan: What is your motivation for this patch?
> 
> Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
> But people should first check the current infra cache as these values would determine if this setting would help.
> 
> I hope a could make some things a little bit more clear.
> 
> Greetings Jonatan   
>> 
>>> 
>>> -- 
>>> Tapani Tarvainen
  
Paul Simmons Jan. 9, 2021, 6:57 p.m. UTC | #12
On 1/9/21 9:04 AM, Michael Tremer wrote:
> Hi,
>
> In that case, I do not think that this change realistically changes anything for anyone.
>
> In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
>
> He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
>
> Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
>
> We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
>
> -Michael
>
>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag@ipfire.org> wrote:
>>
>> Hi,
>>
>> I will try to provide some explanations to the questions.
>>
>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:
>>>
>>> Hello,
>>>
>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
>>>>
>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>>>>
>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>>>>>
>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>> When unbound has no information about a DNS-server
>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>>>>> but they mention in their documentation that this could be way too low.
>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>> the patch has:
>>>>
>>>>>>> +    unknown-server-time-limit: 1128
>>>> But that's trivial. The point:
>>>>
>>>>> I am not entirely sure what this is supposed to fix.
>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>> Does it harm us if we send another packet? No.
>>>> If you are behind a slow satellite link, it can take more than that
>>>> *every time*.
>> This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
>>
>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>
>> When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site.
>> The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
>>
>> So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
>>
>> When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
>>
>>>> So you would always have sent another query before
>>>> getting a response to the previous one.
>>> True, but aren’t these extra-ordinary circumstances?
>>>
>>> On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
>>>
>>> If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
>>>
>>>> With TCP that would mean never getting a response, because you'd
>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>> depends on how unbound handles incoming responses to queries it's
>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>> But it may be it would not work at all when the limit is too small.
>>>>
>>>> That would mean that someone installing IPFire in some remote location
>>>> with a slow link would conclude that it just doesn't work.
>>>>
>>>> The downside of increasing the limit is that sometimes replies will
>>>> take longer when a packet is lost on the way because we'd wait longer
>>>> before re-sending. So it should not be increased too much either.
>> This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>> Did anyone actually experience some problems here that this needs changing?
>>>
>>> @Jonatan: What is your motivation for this patch?
>> Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
>> But people should first check the current infra cache as these values would determine if this setting would help.
>>
>> I hope a could make some things a little bit more clear.
>>
>> Greetings Jonatan
>>>> -- 
>>>> Tapani Tarvainen

Greetings, Michael and @list.

I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS 
server list from the wiki.  I can test more, if desired.

The fastest return was 596ms, and the slowest was 857ms.  At present, 
I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).

My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to 
the release with TLS support, I was unable to resolve hosts at all.  
(Did I mention that I dislike HughesNot?  I have no other option for 
'net connectivity - boonie life is great for the nerves, but hell on 
talking to anyone.)

I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it 
will clarify the situation.  Also, I'm prepared to backup and edit any 
other files that might assist testing.

I've noticed (from NTP logs) that name resolution usually stalls/fails 
after ~3 hours when my LAN is quiet.  Could changes to cache timeout 
settings be beneficial?

Please advise...

Thank you (and, GREAT EFFORT, ALL!),

Paul
  
Tapani Tarvainen Jan. 10, 2021, 2:07 p.m. UTC | #13
On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:

> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS
> server list from the wiki.  I can test more, if desired.
> 
> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm
> using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).

Wow. That *is* slow.

> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if
> it will clarify the situation.

I think it would be very useful if you could test if changing the
limits actually helps in your situation.

It's easy enough to do: e.g.,

echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts

and restart unbound and see if it makes a difference for you.

You might also try if non-TLS settings (TCP or UDP) work after that.
  
Michael Tremer Jan. 11, 2021, 11:10 a.m. UTC | #14
> On 9 Jan 2021, at 18:57, Paul Simmons <mbatranch@gmail.com> wrote:
> 
> On 1/9/21 9:04 AM, Michael Tremer wrote:
>> Hi,
>> 
>> In that case, I do not think that this change realistically changes anything for anyone.
>> 
>> In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
>> 
>> He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
>> 
>> Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
>> 
>> We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
>> 
>> -Michael
>> 
>>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag@ipfire.org> wrote:
>>> 
>>> Hi,
>>> 
>>> I will try to provide some explanations to the questions.
>>> 
>>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:
>>>> 
>>>> Hello,
>>>> 
>>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
>>>>> 
>>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>>>>> 
>>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>>>>>> 
>>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>>> When unbound has no information about a DNS-server
>>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>>>>>> but they mention in their documentation that this could be way too low.
>>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>>> the patch has:
>>>>> 
>>>>>>>> +    unknown-server-time-limit: 1128
>>>>> But that's trivial. The point:
>>>>> 
>>>>>> I am not entirely sure what this is supposed to fix.
>>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>>> Does it harm us if we send another packet? No.
>>>>> If you are behind a slow satellite link, it can take more than that
>>>>> *every time*.
>>> This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
>>> 
>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>> 
>>> When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site.
>>> The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
>>> 
>>> So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
>>> 
>>> When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
>>> 
>>>>> So you would always have sent another query before
>>>>> getting a response to the previous one.
>>>> True, but aren’t these extra-ordinary circumstances?
>>>> 
>>>> On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
>>>> 
>>>> If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
>>>> 
>>>>> With TCP that would mean never getting a response, because you'd
>>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>>> depends on how unbound handles incoming responses to queries it's
>>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>>> But it may be it would not work at all when the limit is too small.
>>>>> 
>>>>> That would mean that someone installing IPFire in some remote location
>>>>> with a slow link would conclude that it just doesn't work.
>>>>> 
>>>>> The downside of increasing the limit is that sometimes replies will
>>>>> take longer when a packet is lost on the way because we'd wait longer
>>>>> before re-sending. So it should not be increased too much either.
>>> This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
>>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>>> Did anyone actually experience some problems here that this needs changing?
>>>> 
>>>> @Jonatan: What is your motivation for this patch?
>>> Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
>>> But people should first check the current infra cache as these values would determine if this setting would help.
>>> 
>>> I hope a could make some things a little bit more clear.
>>> 
>>> Greetings Jonatan
>>>>> -- 
>>>>> Tapani Tarvainen
> 
> Greetings, Michael and @list.
> 
> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki.  I can test more, if desired.
> 
> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
> 
> My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to the release with TLS support, I was unable to resolve hosts at all.  (Did I mention that I dislike HughesNot?  I have no other option for 'net connectivity - boonie life is great for the nerves, but hell on talking to anyone.)

The good thing is though, that we have a good test-bed for this kind of connection :)

I know of some more people who use a satellite connection, but they are not very keen on testing things with it.

> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation.  Also, I'm prepared to backup and edit any other files that might assist testing.
> 
> I've noticed (from NTP logs) that name resolution usually stalls/fails after ~3 hours when my LAN is quiet.  Could changes to cache timeout settings be beneficial?
> 
> Please advise...
> 
> Thank you (and, GREAT EFFORT, ALL!),
> 
> Paul
> 
> -- 
> It is better to have loved a short man than never to have loved a tall.
>
  
Paul Simmons Jan. 12, 2021, 4:37 a.m. UTC | #15
On 1/11/21 5:10 AM, Michael Tremer wrote:
>
>> On 9 Jan 2021, at 18:57, Paul Simmons <mbatranch@gmail.com> wrote:
>>
>> On 1/9/21 9:04 AM, Michael Tremer wrote:
>>> Hi,
>>>
>>> In that case, I do not think that this change realistically changes anything for anyone.
>>>
>>> In Paul’s case, where the name servers are further away than the timeout, he would send another packet, but then receive the first reply (not regarding any actual packet loss here), and after that unbound will have learned that the name server is further away.
>>>
>>> He would have sent one extra packet. Potentially re-probing will cause the same effect, but usually unbound should be busy enough to have a rolling mean that is up to date at any time.
>>>
>>> Therefore this only matters in recursor mode where there are many servers being contacted instead of only a few forwarders. Again, there would be more overhead here, but there should not be any effect where names cannot be resolved.
>>>
>>> We can now increase the timeout, which will cause slower resolution for many users that are running in recursor mode, or we can just leave it and nothing would change.
>>>
>>> -Michael
>>>
>>>> On 8 Jan 2021, at 17:33, Jonatan Schlag <jonatan.schlag@ipfire.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I will try to provide some explanations to the questions.
>>>>
>>>>> Am 06.01.2021 um 19:01 schrieb Michael Tremer <michael.tremer@ipfire.org>:
>>>>>
>>>>> Hello,
>>>>>
>>>>>> On 6 Jan 2021, at 16:19, Tapani Tarvainen <ipfire@tapanitarvainen.fi> wrote:
>>>>>>
>>>>>> On Wed, Jan 06, 2021 at 03:14:52PM +0000, Michael Tremer (michael.tremer@ipfire.org) wrote:
>>>>>>
>>>>>>>> On 6 Jan 2021, at 12:02, Paul Simmons <mbatranch@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On 1/6/21 4:17 AM, Jonatan Schlag wrote:
>>>>>>>>> When unbound has no information about a DNS-server
>>>>>>>>> a timeout of 376 msec is assumed. This works well in a lot of situations,
>>>>>>>>> but they mention in their documentation that this could be way too low.
>>>>>>>>> They recommend a timeout of 1126 msec for satellite connections
>>>>>>>>> (https://nlnetlabs.nl/documentation/unbound/unbound.conf).
>>>>>> A small nit, they actually suggest 1128 ... and that's indeed what
>>>>>> the patch has:
>>>>>>
>>>>>>>>> +    unknown-server-time-limit: 1128
>>>>>> But that's trivial. The point:
>>>>>>
>>>>>>> I am not entirely sure what this is supposed to fix.
>>>>>>> It is possible that a DNS response takes longer than 376ms, indeed.
>>>>>>> Does it harm us if we send another packet? No.
>>>>>> If you are behind a slow satellite link, it can take more than that
>>>>>> *every time*.
>>>> This should actually not the case. There is no fixed timeout which can be set in unbound. They do something much sophisticated here.
>>>>
>>>> https://nlnetlabs.nl/documentation/unbound/info-timeout/
>>>>
>>>> When I unterstand this document correctly. They keep something like a rolling mean. So if everybody would execute ‚unbound-control dump_infra‘ we all would get different timeout limits for every server and every site.
>>>> The actual calculation seems to much more complex (or their explanation of simple things is very complex without any formulas), this is only a simple explanation which seems to be necessary for my next paragraph.
>>>>
>>>> So the question is, when we have no information about a server (for example right after startup of unbound or if the entry in the infra cache has expired (time limit 15 min)), which timeout should we assume. We currently assume a timeout of 376 msec. They state in their documentation that on slow links 1128 msec is more suitable.
>>>>
>>>> When we have informations about a server (so the rtt of previous requests), this value should not matter, when I am get this right.
>>>>
>>>>>> So you would always have sent another query before
>>>>>> getting a response to the previous one.
>>>>> True, but aren’t these extra-ordinary circumstances?
>>>>>
>>>>> On a regular network we want to keep eyeballs happy and when packets get lost or get sent to a slow server, we want to try again - sooner rather than later.
>>>>>
>>>>> If we would set this to a worst case setting (let’s say 10 seconds), then even for average users DNS resolution will become slower.
>>>>>
>>>>>> With TCP that would mean never getting a response, because you'd
>>>>>> always terminate the connection too soon. With UDP, I'm not sure,
>>>>>> depends on how unbound handles incoming responses to queries it's
>>>>>> already deemed lost and sent again. Adjusting delay-close might help.
>>>>>> But it may be it would not work at all when the limit is too small.
>>>>>>
>>>>>> That would mean that someone installing IPFire in some remote location
>>>>>> with a slow link would conclude that it just doesn't work.
>>>>>>
>>>>>> The downside of increasing the limit is that sometimes replies will
>>>>>> take longer when a packet is lost on the way because we'd wait longer
>>>>>> before re-sending. So it should not be increased too much either.
>>>> This should only happen in the first time where our own rolling mean is not adjusted to the needs of this side.
>>>>>> I don't have data to judge what the limit should be, but I'd tend to
>>>>>> trust nllabs recommendation here and go with the suggested 1128 ms.
>>>>> Did anyone actually experience some problems here that this needs changing?
>>>>>
>>>>> @Jonatan: What is your motivation for this patch?
>>>> Just opening the discussion. It seems that their handling of timeouts and the infra cache could had caused a lot of problems for some users, so I thought about bringing this up. Maybe it is a good idea that people like Paul test this before we further think about how this could be implemented. Also adding this to the wiki, that this might be a tweak to improve dns resolution, could be a solution.
>>>> But people should first check the current infra cache as these values would determine if this setting would help.
>>>>
>>>> I hope a could make some things a little bit more clear.
>>>>
>>>> Greetings Jonatan
>>>>>> -- 
>>>>>> Tapani Tarvainen
>> Greetings, Michael and @list.
>>
>> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS server list from the wiki.  I can test more, if desired.
>>
>> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
>>
>> My DNS protocol is "TLS", and QNAME Minimisation is "Standard". Prior to the release with TLS support, I was unable to resolve hosts at all.  (Did I mention that I dislike HughesNot?  I have no other option for 'net connectivity - boonie life is great for the nerves, but hell on talking to anyone.)
> The good thing is though, that we have a good test-bed for this kind of connection :)
>
> I know of some more people who use a satellite connection, but they are not very keen on testing things with it.
>
>> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if it will clarify the situation.  Also, I'm prepared to backup and edit any other files that might assist testing.
>>
>> I've noticed (from NTP logs) that name resolution usually stalls/fails after ~3 hours when my LAN is quiet.  Could changes to cache timeout settings be beneficial?
>>
>> Please advise...
>>
>> Thank you (and, GREAT EFFORT, ALL!),
>>
>> Paul
>>
>> -- 
>> It is better to have loved a short man than never to have loved a tall.
>>
I'm pleased to be able to help, and grateful for the attention and 
assistance.  See my next msg for testing update.

p.
  
Paul Simmons Jan. 12, 2021, 5:07 a.m. UTC | #16
On 1/10/21 8:07 AM, Tapani Tarvainen wrote:
> On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
>
>> I tested the ping (-c1) times for the first 27 IPv4 addresses in the DNS
>> server list from the wiki.  I can test more, if desired.
>>
>> The fastest return was 596ms, and the slowest was 857ms.  At present, I'm
>> using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
> Wow. That *is* slow.
>
>> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if
>> it will clarify the situation.
> I think it would be very useful if you could test if changing the
> limits actually helps in your situation.
>
> It's easy enough to do: e.g.,
>
> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>
> and restart unbound and see if it makes a difference for you.
>
> You might also try if non-TLS settings (TCP or UDP) work after that.
>
Hello, I have some results.

The /etc/unbound/local.d/timeouts (+unbound restart) did not completely 
resolve NTP related lookup failures.  It "seemed" to prevent complete 
failure, but the first of two lookups, to different pool aliases, did fail.

I retained the "timeouts" and changed from TLS to TCP, and haven't seen 
any lookup failures.

Tomorrow, I will experiment using "timeouts" and UDP.  After a day or 
so, I'll try removing the "timeouts" and repeat the TCP and UDP tests.

Thank you!

p.
  
Paul Simmons Jan. 16, 2021, 3:02 a.m. UTC | #17
On 1/11/21 11:07 PM, Paul Simmons wrote:
> On 1/10/21 8:07 AM, Tapani Tarvainen wrote:
>> On Sat, Jan 09, 2021 at 12:57:44PM -0600, Paul Simmons 
>> (mbatranch@gmail.com) wrote:
>>
>>> I tested the ping (-c1) times for the first 27 IPv4 addresses in the 
>>> DNS
>>> server list from the wiki.  I can test more, if desired.
>>>
>>> The fastest return was 596ms, and the slowest was 857ms.  At 
>>> present, I'm
>>> using 9.9.9.10 (631ms ping) and 81.3.27.54 (752ms ping).
>> Wow. That *is* slow.
>>
>>> I'm willing to test Tapani's "/etc/unbound/local.d" proposal(s), if
>>> it will clarify the situation.
>> I think it would be very useful if you could test if changing the
>> limits actually helps in your situation.
>>
>> It's easy enough to do: e.g.,
>>
>> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>>
>> and restart unbound and see if it makes a difference for you.
>>
>> You might also try if non-TLS settings (TCP or UDP) work after that.
>>
> Hello, I have some results.
>
> The /etc/unbound/local.d/timeouts (+unbound restart) did not 
> completely resolve NTP related lookup failures.  It "seemed" to 
> prevent complete failure, but the first of two lookups, to different 
> pool aliases, did fail.
>
> I retained the "timeouts" and changed from TLS to TCP, and haven't 
> seen any lookup failures.
>
> Tomorrow, I will experiment using "timeouts" and UDP.  After a day or 
> so, I'll try removing the "timeouts" and repeat the TCP and UDP tests.
>
> Thank you!
>
> p.
>
I've found that UDP doesn't work at all.  TCP with "timeout" mod never 
fails.

Will now test TCP without "timeout" mod.

Paul
  
Tapani Tarvainen Jan. 16, 2021, 8:13 a.m. UTC | #18
On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:

> > > echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts

> I've found that UDP doesn't work at all.  TCP with "timeout" mod never
> fails.

You might also try if UDP works with

delay-close: 1500

instead of or in addition to the unknown-server-time-limit.
  
Paul Simmons Jan. 19, 2021, 6:22 a.m. UTC | #19
On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
> On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
>
>>>> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>> I've found that UDP doesn't work at all.  TCP with "timeout" mod never
>> fails.
> You might also try if UDP works with
>
> delay-close: 1500
>
> instead of or in addition to the unknown-server-time-limit.
>
Howdy!

I tried UDP with both mods ('unknown-server-time-limit: 1128' && 
'delay-close: 1500').  Unfortunately, I experienced intermittent 
resolution errors.

Am now using TCP...  no apparent errors, but resolution is SssLllOooWww, 
just as before.
(total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 
total.num.recursivereplies=1515)

Thank you for your efforts.  Latency on "HughesNot" is insurmountable, 
but (barely) beats no connectivity.  I hope to try Starlink, if/when it 
becomes available for my latitude (30.9 North).

Paul
  
Michael Tremer Jan. 25, 2021, 7:23 p.m. UTC | #20
Hello everyone,

So what does that leave us with?

Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?

-Michael

> On 19 Jan 2021, at 06:22, Paul Simmons <mbatranch@gmail.com> wrote:
> 
> On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
>> On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
>> 
>>>>> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>>> I've found that UDP doesn't work at all.  TCP with "timeout" mod never
>>> fails.
>> You might also try if UDP works with
>> 
>> delay-close: 1500
>> 
>> instead of or in addition to the unknown-server-time-limit.
>> 
> Howdy!
> 
> I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500').  Unfortunately, I experienced intermittent resolution errors.
> 
> Am now using TCP...  no apparent errors, but resolution is SssLllOooWww, just as before.
> (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
> 
> Thank you for your efforts.  Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity.  I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
> 
> Paul
> 
> -- 
> It is hard for an empty bag to stand upright.  -- Benjamin Franklin, 1757
>
  
Paul Simmons Jan. 25, 2021, 8:29 p.m. UTC | #21
On 1/25/21 1:23 PM, Michael Tremer wrote:
> Hello everyone,
>
> So what does that leave us with?
>
> Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?
>
> -Michael
>
>> On 19 Jan 2021, at 06:22, Paul Simmons <mbatranch@gmail.com> wrote:
>>
>> On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
>>> On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
>>>
>>>>>> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>>>> I've found that UDP doesn't work at all.  TCP with "timeout" mod never
>>>> fails.
>>> You might also try if UDP works with
>>>
>>> delay-close: 1500
>>>
>>> instead of or in addition to the unknown-server-time-limit.
>>>
>> Howdy!
>>
>> I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500').  Unfortunately, I experienced intermittent resolution errors.
>>
>> Am now using TCP...  no apparent errors, but resolution is SssLllOooWww, just as before.
>> (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
>>
>> Thank you for your efforts.  Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity.  I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
>>
>> Paul
>>
>> -- 
>> It is hard for an empty bag to stand upright.  -- Benjamin Franklin, 1757
>>
I haven't studied the metrics from unbound, so can't say if the modified 
timeouts help to avoid retransmissions.

As of this moment, TCP works, albeit slowly.  If you'd rather drop the 
patch, I'm okay with that.

Thanks for all the effort!

Paul
  
Michael Tremer Jan. 25, 2021, 8:50 p.m. UTC | #22
Hi,

> On 25 Jan 2021, at 20:29, Paul Simmons <mbatranch@gmail.com> wrote:
> 
> On 1/25/21 1:23 PM, Michael Tremer wrote:
>> Hello everyone,
>> 
>> So what does that leave us with?
>> 
>> Should we drop the patch because it does not change anything and the correct solution would be using TCP as underlying protocol?
>> 
>> -Michael
>> 
>>> On 19 Jan 2021, at 06:22, Paul Simmons <mbatranch@gmail.com> wrote:
>>> 
>>> On 1/16/21 2:13 AM, Tapani Tarvainen wrote:
>>>> On Fri, Jan 15, 2021 at 09:02:08PM -0600, Paul Simmons (mbatranch@gmail.com) wrote:
>>>> 
>>>>>>> echo 'unknown-server-time-limit: 1128' >/etc/unbound/local.d/timeouts
>>>>> I've found that UDP doesn't work at all.  TCP with "timeout" mod never
>>>>> fails.
>>>> You might also try if UDP works with
>>>> 
>>>> delay-close: 1500
>>>> 
>>>> instead of or in addition to the unknown-server-time-limit.
>>>> 
>>> Howdy!
>>> 
>>> I tried UDP with both mods ('unknown-server-time-limit: 1128' && 'delay-close: 1500').  Unfortunately, I experienced intermittent resolution errors.
>>> 
>>> Am now using TCP...  no apparent errors, but resolution is SssLllOooWww, just as before.
>>> (total.recursion.time.avg=4.433958 total.recursion.time.median=3.65429 total.num.recursivereplies=1515)
>>> 
>>> Thank you for your efforts.  Latency on "HughesNot" is insurmountable, but (barely) beats no connectivity.  I hope to try Starlink, if/when it becomes available for my latitude (30.9 North).
>>> 
>>> Paul
>>> 
>>> -- 
>>> It is hard for an empty bag to stand upright.  -- Benjamin Franklin, 1757
>>> 
> I haven't studied the metrics from unbound, so can't say if the modified timeouts help to avoid retransmissions.
> 
> As of this moment, TCP works, albeit slowly.  If you'd rather drop the patch, I'm okay with that.

Yes, TCP should always work and it will be much faster with Core Update 154 since the connections remain open.

We can always come back to this thread if there is any reason in the future.

> Thanks for all the effort!

Thank you very much for your testing, too!

Best,
-Michael

> Paul
  

Patch

diff --git a/config/unbound/unbound.conf b/config/unbound/unbound.conf
index f78aaae8c..02f093015 100644
--- a/config/unbound/unbound.conf
+++ b/config/unbound/unbound.conf
@@ -62,6 +62,7 @@  server:
 
 	# Timeout behaviour
 	infra-keep-probing: yes
+	unknown-server-time-limit: 1128
 
 	# Bootstrap root servers
 	root-hints: "/etc/unbound/root.hints"