Keep trying DHCP on red0 if unavailable at boot

Message ID 4861861e-1017-1596-e5bb-88f80c967971@jacknife.org
State Not Applicable
Headers
Series Keep trying DHCP on red0 if unavailable at boot |

Commit Message

Brad Spencer Jan. 9, 2022, 1:27 a.m. UTC
  I had a failure today on my new ipfire installation that didn't survive 
the same kind of outage that my hand-made Debian-based firewall box had 
survived many times in the past: a power failure and restoration.

In the community pages, I picked up an existing discussion and discussed 
the scenario in detail.  I won't repeat that discussion here, but it was 
suggested that I post to this list and work towards an upstream change.

https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-reconnection/2455/26?u=spencer

I won't repeat all the details of how I discovered this, but allow me to 
summarize the small changes I made.  (See the community post and 
followups there for full details.)

Basically, when ipfire boots and DHCP on red doesn't provide an address, 
dhcpcd times out after 60 seconds and then stops trying and nothing 
makes it try again.  This leaves the green network up (good!) but the 
red network completely dead until someone reboots ipfire (or takes some 
other steps that re-trigger a start of dhcpcd).

My simple repair so far has been:

1. Edit /etc/init.d/networking/functions.network to start dhcpcd in the 
background with no timeout.

(Be sure to include that last space inside the quotes!)

2. For my testing, I also set ntp's ENABLESETONBOOT in 
/var/ipfire/time/settings to off (aka “Force setting the system clock on 
boot”) because it sits in a loop waiting for red0 to come up otherwise!

At the time, I didn't notice that the loop in /etc/init.d/ntp stops 
after a minute, but nonetheless, it was handy to turn it off while 
testing :)  So, I _think_ only the first change is necessary.

All of the testing I did so far seems to indicate that, provided I don't 
have rules that explicitly mention the red0 IP address, all works well 
when the lease is acquired, or even when the lease changes the IP 
unexpectedly.

Is a change like this something that could become part of ipfire?

Thanks for making ipfire!  I'm impressed so far.
  

Comments

Jose A. Dias Jan. 9, 2022, 2:03 a.m. UTC | #1
I've had dhcp do that to me and it doesn't take a power failure but a downed isp will do.

My solution is to have a script run every 5 minutes to try again and after an hour reboot ipfire again.

My desktop just died and I'm rebuilding it but I can post that script if it's helpful. 

⁣Get BlueMail for Android ​


-------- Original Message --------
From: Brad Spencer <spencer@jacknife.org>
Sent: Sat Jan 08 20:27:28 EST 2022
To: development@lists.ipfire.org
Subject: Keep trying DHCP on red0 if unavailable at boot

I had a failure today on my new ipfire installation that didn't survive 
the same kind of outage that my hand-made Debian-based firewall box had 
survived many times in the past: a power failure and restoration.

In the community pages, I picked up an existing discussion and discussed 
the scenario in detail.  I won't repeat that discussion here, but it was 
suggested that I post to this list and work towards an upstream change.

https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-reconnection/2455/26?u=spencer

I won't repeat all the details of how I discovered this, but allow me to 
summarize the small changes I made.  (See the community post and 
followups there for full details.)

Basically, when ipfire boots and DHCP on red doesn't provide an address, 
dhcpcd times out after 60 seconds and then stops trying and nothing 
makes it try again.  This leaves the green network up (good!) but the 
red network completely dead until someone reboots ipfire (or takes some 
other steps that re-trigger a start of dhcpcd).

My simple repair so far has been:

1. Edit /etc/init.d/networking/functions.network to start dhcpcd in the 
background with no timeout.

--- /root/functions.network.orig        2022-01-08 16:26:02.956856033 -0400
+++ functions.network   2022-01-08 21:07:28.617170885 -0400
@@ -56,7 +56,7 @@
         # This function will start a dhcpcd on a speciefied device.

         local device="$1"
-       local dhcp_start=""
+       local dhcp_start="--timeout 0 --background "

         boot_mesg -n "Starting dhcpcd on the ${device} interface..."

(Be sure to include that last space inside the quotes!)

2. For my testing, I also set ntp's ENABLESETONBOOT in 
/var/ipfire/time/settings to off (aka “Force setting the system clock on 
boot”) because it sits in a loop waiting for red0 to come up otherwise!

At the time, I didn't notice that the loop in /etc/init.d/ntp stops 
after a minute, but nonetheless, it was handy to turn it off while 
testing :)  So, I _think_ only the first change is necessary.

All of the testing I did so far seems to indicate that, provided I don't 
have rules that explicitly mention the red0 IP address, all works well 
when the lease is acquired, or even when the lease changes the IP 
unexpectedly.

Is a change like this something that could become part of ipfire?

Thanks for making ipfire!  I'm impressed so far.
  
Brad Spencer Jan. 9, 2022, 3:04 a.m. UTC | #2
On 1/8/2022 10:03 PM, Jose Dias wrote:
> I've had dhcp do that to me and it doesn't take a power failure but a 
> downed isp will do.
>
> My solution is to have a script run every 5 minutes to try again and 
> after an hour reboot ipfire again.
>
> My desktop just died and I'm rebuilding it but I can post that script 
> if it's helpful.

Yes, I agree.  In the community post I linked to, I tried to explain 
details failure.

With my change to its startup arguments, I've been able to demonstrate 
that even if dhcpcd is unable to obtain an initial DHCP lease at boot, 
ipfire's boot sequence is not blocked, and dhcpcd does correctly keep 
retrying for longer than 60 seconds, and that ipfire reacts correctly 
when an IP is leased.  See

If you're interested, you can see my most recent followup in the 
community: 
https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-reconnection/2455/38?u=spencer

So, I'm going to try this instead of rebooting.  The retries are 
frequent and cheap, and other (local) ipfire services remain available 
the whole time.

Thanks for the offer!
  
Jonatan Schlag Jan. 9, 2022, 3:32 p.m. UTC | #3
Hi,

thanks for getting in touch. The behaviour you described is tracked in bug #10813.
I am currently working on  a clean solution for this and bug #11502 but this takes time. The scripts involved in this are rather old (2007) which makes the fix more of a rewrite. 

You can see my progress here: 
https://git.ipfire.org/?p=people/jschlag/ipfire-2.x.git;a=shortlog;h=refs/heads/improve_network_startup 

As this needs a very good test a will reach out to the list when i have an iso image with all necessary changes.

Greetings Jonatan
> Am 09.01.2022 um 04:04 schrieb Brad Spencer <spencer@jacknife.org>:
> 
> On 1/8/2022 10:03 PM, Jose Dias wrote:
>> I've had dhcp do that to me and it doesn't take a power failure but a downed isp will do.
>> 
>> My solution is to have a script run every 5 minutes to try again and after an hour reboot ipfire again.
>> 
>> My desktop just died and I'm rebuilding it but I can post that script if it's helpful.
> 
> Yes, I agree.  In the community post I linked to, I tried to explain details failure.
> 
> With my change to its startup arguments, I've been able to demonstrate that even if dhcpcd is unable to obtain an initial DHCP lease at boot, ipfire's boot sequence is not blocked, and dhcpcd does correctly keep retrying for longer than 60 seconds, and that ipfire reacts correctly when an IP is leased.  See
> 
> If you're interested, you can see my most recent followup in the community: https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-reconnection/2455/38?u=spencer
> 
> So, I'm going to try this instead of rebooting.  The retries are frequent and cheap, and other (local) ipfire services remain available the whole time.
> 
> Thanks for the offer!
> 
> -- 
> Brad Spencer
>
  
Jose A. Dias Jan. 9, 2022, 9:02 p.m. UTC | #4
This is what I use. It actually runs out of /etc/fcron.hourly . The way it works for me, it'll keep trying to get an IP and if after 15 cycles it still doesn't have an IP then it reboots. Rebooting has no impact on dhcp on green as lease times will withstand a reboot.

[root@harold fcron.hourly]# cat /etc/fcron.hourly/monitor_red_device.sh
#!/bin/sh

# set -vx

SLEEP=30s
DEV=red0

CYCLES=15

IPLOST=0

for c in `seq 1 $CYCLES` ; do

        # get ip address from external device, default red0
        IP=
        IP=`ip address show dev ${DEV} | grep inet | awk '{print $2}'`

        echo IP=${IP}

        # if we got an ip then exit
        if [ -n "${IP}" ] ; then
                if [ ${IPLOST} -eq 0 ] ; then
                        logger -t ipfire "IP= ${IP}"
                else
                        logger -t ipfire "IP= ${IP} reaquired."
                fi
                exit 0
        fi

        IPLOST=1
        logger -t ipfire 'IP lost'
        # we don't have an IP address. wait a minute
        sleep ${SLEEP}

        /etc/init.d/networking/red stop ${DEV}

        sleep ${SLEEP}

        /etc/init.d/networking/red start ${DEV}
        sleep ${SLEEP}

done

logger -t ipfire 'IP not aquired. Rebooting.'
# if we reach this far then we might as well restart
telinit 6


-----Original Message-----
From: Jonatan Schlag [mailto:jonatan.schlag@ipfire.org]
Sent: Sun 1/9/2022 10:32 AM
To: Brad Spencer
Cc: Jose A. Dias; IPFire Development
Subject: Re: Keep trying DHCP on red0 if unavailable at boot
 
Hi,

thanks for getting in touch. The behaviour you described is tracked in bug #10813.
I am currently working on  a clean solution for this and bug #11502 but this takes time. The scripts involved in this are rather old (2007) which makes the fix more of a rewrite. 

You can see my progress here: 
https://git.ipfire.org/?p=people/jschlag/ipfire-2.x.git;a=shortlog;h=refs/heads/improve_network_startup 

As this needs a very good test a will reach out to the list when i have an iso image with all necessary changes.

Greetings Jonatan
> Am 09.01.2022 um 04:04 schrieb Brad Spencer <spencer@jacknife.org>:
> 
> ?On 1/8/2022 10:03 PM, Jose Dias wrote:
>> I've had dhcp do that to me and it doesn't take a power failure but a downed isp will do.
>> 
>> My solution is to have a script run every 5 minutes to try again and after an hour reboot ipfire again.
>> 
>> My desktop just died and I'm rebuilding it but I can post that script if it's helpful.
> 
> Yes, I agree.  In the community post I linked to, I tried to explain details failure.
> 
> With my change to its startup arguments, I've been able to demonstrate that even if dhcpcd is unable to obtain an initial DHCP lease at boot, ipfire's boot sequence is not blocked, and dhcpcd does correctly keep retrying for longer than 60 seconds, and that ipfire reacts correctly when an IP is leased.  See
> 
> If you're interested, you can see my most recent followup in the community: https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-reconnection/2455/38?u=spencer
> 
> So, I'm going to try this instead of rebooting.  The retries are frequent and cheap, and other (local) ipfire services remain available the whole time.
> 
> Thanks for the offer!
> 
> -- 
> Brad Spencer
>
  
Jose A. Dias Jan. 14, 2022, 12:03 a.m. UTC | #5
Well, Rogers (my ISP) does not disappoint in this regard. They disappoint in
other ways, but down time they are producing.

 

This is what the logs show for today. This is from the IPFire section of the
System Logs. I lost the IP at around at about 10:30 local time. I did some
maintenance while it was out and then I let it cycle through.  The only
change I've made in the script below was to use CYCLES=18 to let it run a
couple more times in the hour. This is rough but it gets around the bug for
how.

 



 

From: Development <development-bounces@lists.ipfire.org> On Behalf Of Jose
A. Dias
Sent: Sunday, January 9, 2022 4:02 PM
To: Jonatan Schlag <jonatan.schlag@ipfire.org>; Brad Spencer
<spencer@jacknife.org>
Cc: IPFire Development <development@lists.ipfire.org>
Subject: RE: Keep trying DHCP on red0 if unavailable at boot

 

This is what I use. It actually runs out of /etc/fcron.hourly . The way it
works for me, it'll keep trying to get an IP and if after 15 cycles it still
doesn't have an IP then it reboots. Rebooting has no impact on dhcp on green
as lease times will withstand a reboot.

[root@harold fcron.hourly]# cat /etc/fcron.hourly/monitor_red_device.sh
#!/bin/sh

# set -vx

SLEEP=30s
DEV=red0

CYCLES=15

IPLOST=0

for c in `seq 1 $CYCLES` ; do

        # get ip address from external device, default red0
        IP=
        IP=`ip address show dev ${DEV} | grep inet | awk '{print $2}'`

        echo IP=${IP}

        # if we got an ip then exit
        if [ -n "${IP}" ] ; then
                if [ ${IPLOST} -eq 0 ] ; then
                        logger -t ipfire "IP= ${IP}"
                else
                        logger -t ipfire "IP= ${IP} reaquired."
                fi
                exit 0
        fi

        IPLOST=1
        logger -t ipfire 'IP lost'
        # we don't have an IP address. wait a minute
        sleep ${SLEEP}

        /etc/init.d/networking/red stop ${DEV}

        sleep ${SLEEP}

        /etc/init.d/networking/red start ${DEV}
        sleep ${SLEEP}

done

logger -t ipfire 'IP not aquired. Rebooting.'
# if we reach this far then we might as well restart
telinit 6


-----Original Message-----
From: Jonatan Schlag [mailto:jonatan.schlag@ipfire.org]
Sent: Sun 1/9/2022 10:32 AM
To: Brad Spencer
Cc: Jose A. Dias; IPFire Development
Subject: Re: Keep trying DHCP on red0 if unavailable at boot

Hi,

thanks for getting in touch. The behaviour you described is tracked in bug
#10813.
I am currently working on  a clean solution for this and bug #11502 but this
takes time. The scripts involved in this are rather old (2007) which makes
the fix more of a rewrite.

You can see my progress here:
https://git.ipfire.org/?p=people/jschlag/ipfire-2.x.git;a=shortlog;h=refs/he
ads/improve_network_startup

As this needs a very good test a will reach out to the list when i have an
iso image with all necessary changes.

Greetings Jonatan
> Am 09.01.2022 um 04:04 schrieb Brad Spencer <spencer@jacknife.org
<mailto:spencer@jacknife.org> >:
>
> ?On 1/8/2022 10:03 PM, Jose Dias wrote:
>> I've had dhcp do that to me and it doesn't take a power failure but a
downed isp will do.
>>
>> My solution is to have a script run every 5 minutes to try again and
after an hour reboot ipfire again.
>>
>> My desktop just died and I'm rebuilding it but I can post that script if
it's helpful.
>
> Yes, I agree.  In the community post I linked to, I tried to explain
details failure.
>
> With my change to its startup arguments, I've been able to demonstrate
that even if dhcpcd is unable to obtain an initial DHCP lease at boot,
ipfire's boot sequence is not blocked, and dhcpcd does correctly keep
retrying for longer than 60 seconds, and that ipfire reacts correctly when
an IP is leased.  See
>
> If you're interested, you can see my most recent followup in the
community:
https://community.ipfire.org/t/dhcp-client-on-red0-wont-reassign-ip-upon-rec
onnection/2455/38?u=spencer
>
> So, I'm going to try this instead of rebooting.  The retries are frequent
and cheap, and other (local) ipfire services remain available the whole
time.
>
> Thanks for the offer!
>
> --
> Brad Spencer
>
  

Patch

--- /root/functions.network.orig        2022-01-08 16:26:02.956856033 -0400
+++ functions.network   2022-01-08 21:07:28.617170885 -0400
@@ -56,7 +56,7 @@ 
         # This function will start a dhcpcd on a speciefied device.

         local device="$1"
-       local dhcp_start=""
+       local dhcp_start="--timeout 0 --background "

         boot_mesg -n "Starting dhcpcd on the ${device} interface..."