[v2,2/2] dns.cgi: Fixes bug#12395 - German umlauts not correctly displayed in remarks

Message ID 20240311121909.5445-2-adolf.belka@ipfire.org
State Staged
Commit 716b8fb503b5c02ac921cd843f8c646892c4cf55
Headers
Series [v2,1/2] dns.cgi: Revert "dns.cgi: Fixes bug#12395 - German umlauts not correctly displayed in remarks" |

Commit Message

Adolf Belka March 11, 2024, 12:19 p.m. UTC
  - If Freifunk München e.V. is entered as a remark it gets converted to
   Freifunk München e.V.
- This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
   and the HTML::Entities::encode_entities command that is run on that remark text does
   not work with UTF-8 text.
- If the UTF-8 text in the remark is decoded before running through the cleanhtml command
   then the characters with diacritical marks are correctly shown.
- Have tested out the fix on a remark with a range of different characters with
   diacritical marks and all of the ones tested were displayed correctly with the fix while
   in the original form they were mangled.

Fixes: Bug#12395
Tested-by: Adolf Belka <adolf.belka@ipfire.org>
Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
---
 html/cgi-bin/dns.cgi | 7 +++++++
 1 file changed, 7 insertions(+)
  

Comments

Michael Tremer March 12, 2024, 10:02 a.m. UTC | #1
Thank you.

I merged this for now so that we can fix this problem quickly.

However I was wondering whether we should consider making the decode statement a part of the “cleanhtml” function.

I am still unsure why this is happening in the first place. We should be receiving UTF-8 from the browser, and I believe that perl doesn’t natively store things in UTF-8. That is however not a problem, because it should read files the same way it wrote them and so there should not be any difference when we re-read the configuration files. Unless some parts of the code specify any kind of encoding.

-Michael

> On 11 Mar 2024, at 12:19, Adolf Belka <adolf.belka@ipfire.org> wrote:
> 
> - If Freifunk München e.V. is entered as a remark it gets converted to
>   Freifunk München e.V.
> - This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
>   and the HTML::Entities::encode_entities command that is run on that remark text does
>   not work with UTF-8 text.
> - If the UTF-8 text in the remark is decoded before running through the cleanhtml command
>   then the characters with diacritical marks are correctly shown.
> - Have tested out the fix on a remark with a range of different characters with
>   diacritical marks and all of the ones tested were displayed correctly with the fix while
>   in the original form they were mangled.
> 
> Fixes: Bug#12395
> Tested-by: Adolf Belka <adolf.belka@ipfire.org>
> Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
> ---
> html/cgi-bin/dns.cgi | 7 +++++++
> 1 file changed, 7 insertions(+)
> 
> diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
> index 0a34d3fd6..eb6f908d5 100644
> --- a/html/cgi-bin/dns.cgi
> +++ b/html/cgi-bin/dns.cgi
> @@ -142,6 +142,13 @@ if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
> # Go further if there was no error.
> if ( ! $errormessage) {
> # Check if a remark has been entered.
> +
> + # decode the UTF-8 text so that characters with diacritical marks such as
> + # umlauts are treated correctly by the following cleanhtml command
> + $cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
> +
> + # run the REMARK text through cleanhtml to ensure all unsafe html characters
> + # are correctly encoded to their html entities
> $cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
> 
> my %dns_servers = ();
> -- 
> 2.44.0
>
  
Adolf Belka March 12, 2024, 12:27 p.m. UTC | #2
Hi Michael,

On 12/03/2024 11:02, Michael Tremer wrote:
> Thank you.
> 
> I merged this for now so that we can fix this problem quickly.
> 
> However I was wondering whether we should consider making the decode statement a part of the “cleanhtml” function.
That makes a lot of sense. It would also mean that the problem of 
umlauts etc would be fixed everywhere that cleanhtml is used rather than 
needing to fix every invocation of cleanhtml.

I will look at putting something together for that.

> 
> I am still unsure why this is happening in the first place. We should be receiving UTF-8 from the browser, and I believe that perl doesn’t natively store things in UTF-8. That is however not a problem, because it should read files the same way it wrote them and so there should not be any difference when we re-read the configuration files. Unless some parts of the code specify any kind of encoding.
We do receive UTF-8 from the browser. The problem seems to be that the 
HTML::Entities::encode_entities command doesn't work with UTF-8 but with 
ISO-8859-1 encoding. I can't find where I found this the other day when 
I was searching on this topic to understand how to overcome it.

The fix is not encoding the text from the browser remark box into UTF-8 
but decoding it from UTF-8. Once the text is in the files then it is fine.

Of course my reasoning for doing the decoding may or may not be right, 
so I am always open to alternative suggestions.

Regards,

Adolf.
> 
> -Michael
> 
>> On 11 Mar 2024, at 12:19, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>
>> - If Freifunk München e.V. is entered as a remark it gets converted to
>>    Freifunk München e.V.
>> - This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
>>    and the HTML::Entities::encode_entities command that is run on that remark text does
>>    not work with UTF-8 text.
>> - If the UTF-8 text in the remark is decoded before running through the cleanhtml command
>>    then the characters with diacritical marks are correctly shown.
>> - Have tested out the fix on a remark with a range of different characters with
>>    diacritical marks and all of the ones tested were displayed correctly with the fix while
>>    in the original form they were mangled.
>>
>> Fixes: Bug#12395
>> Tested-by: Adolf Belka <adolf.belka@ipfire.org>
>> Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
>> ---
>> html/cgi-bin/dns.cgi | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
>> index 0a34d3fd6..eb6f908d5 100644
>> --- a/html/cgi-bin/dns.cgi
>> +++ b/html/cgi-bin/dns.cgi
>> @@ -142,6 +142,13 @@ if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
>> # Go further if there was no error.
>> if ( ! $errormessage) {
>> # Check if a remark has been entered.
>> +
>> + # decode the UTF-8 text so that characters with diacritical marks such as
>> + # umlauts are treated correctly by the following cleanhtml command
>> + $cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
>> +
>> + # run the REMARK text through cleanhtml to ensure all unsafe html characters
>> + # are correctly encoded to their html entities
>> $cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
>>
>> my %dns_servers = ();
>> -- 
>> 2.44.0
>>
>
  
Michael Tremer March 12, 2024, 2:56 p.m. UTC | #3
Hello,

> On 12 Mar 2024, at 12:27, Adolf Belka <adolf.belka@ipfire.org> wrote:
> 
> Hi Michael,
> 
> On 12/03/2024 11:02, Michael Tremer wrote:
>> Thank you.
>> I merged this for now so that we can fix this problem quickly.
>> However I was wondering whether we should consider making the decode statement a part of the “cleanhtml” function.
> That makes a lot of sense. It would also mean that the problem of umlauts etc would be fixed everywhere that cleanhtml is used rather than needing to fix every invocation of cleanhtml.
> 
> I will look at putting something together for that.

This is a common problem with any kind of software. I suppose a lot of people are used to it and leave out any special characters where possible. I generally tend to comment in English because it’s shorter, doesn’t have the special character issue and allows customers to hire any staff where German isn’t their first language for example. For international customers I am using English anyways. Hence I never really ran into this problem.

>> I am still unsure why this is happening in the first place. We should be receiving UTF-8 from the browser, and I believe that perl doesn’t natively store things in UTF-8. That is however not a problem, because it should read files the same way it wrote them and so there should not be any difference when we re-read the configuration files. Unless some parts of the code specify any kind of encoding.
> We do receive UTF-8 from the browser. The problem seems to be that the HTML::Entities::encode_entities command doesn't work with UTF-8 but with ISO-8859-1 encoding. I can't find where I found this the other day when I was searching on this topic to understand how to overcome it.

Ah, yeah that would make sense. So basically we store it correctly but once we pass it through cleanhtml() we mess it up. Okay. That is good to know.

So we just need to encode/decode back to UTF-8.

> The fix is not encoding the text from the browser remark box into UTF-8 but decoding it from UTF-8. Once the text is in the files then it is fine.

We should always us UTF-8 everywhere. We might have some problems with Chinese or so (no idea really), but for 99% of our user-base UTF-8 should work fine.

> Of course my reasoning for doing the decoding may or may not be right, so I am always open to alternative suggestions.

I believe you found the right path :)

-Michael

> Regards,
> 
> Adolf.
>> -Michael
>>> On 11 Mar 2024, at 12:19, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>> 
>>> - If Freifunk München e.V. is entered as a remark it gets converted to
>>>   Freifunk München e.V.
>>> - This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
>>>   and the HTML::Entities::encode_entities command that is run on that remark text does
>>>   not work with UTF-8 text.
>>> - If the UTF-8 text in the remark is decoded before running through the cleanhtml command
>>>   then the characters with diacritical marks are correctly shown.
>>> - Have tested out the fix on a remark with a range of different characters with
>>>   diacritical marks and all of the ones tested were displayed correctly with the fix while
>>>   in the original form they were mangled.
>>> 
>>> Fixes: Bug#12395
>>> Tested-by: Adolf Belka <adolf.belka@ipfire.org>
>>> Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
>>> ---
>>> html/cgi-bin/dns.cgi | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>> 
>>> diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
>>> index 0a34d3fd6..eb6f908d5 100644
>>> --- a/html/cgi-bin/dns.cgi
>>> +++ b/html/cgi-bin/dns.cgi
>>> @@ -142,6 +142,13 @@ if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
>>> # Go further if there was no error.
>>> if ( ! $errormessage) {
>>> # Check if a remark has been entered.
>>> +
>>> + # decode the UTF-8 text so that characters with diacritical marks such as
>>> + # umlauts are treated correctly by the following cleanhtml command
>>> + $cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
>>> +
>>> + # run the REMARK text through cleanhtml to ensure all unsafe html characters
>>> + # are correctly encoded to their html entities
>>> $cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
>>> 
>>> my %dns_servers = ();
>>> -- 
>>> 2.44.0
>>> 
> 
> -- 
> Sent from my laptop
  
Adolf Belka March 13, 2024, 10:05 p.m. UTC | #4
Hi Michael,

On 12/03/2024 15:56, Michael Tremer wrote:
> Hello,
> 
>> On 12 Mar 2024, at 12:27, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>
>> Hi Michael,
>>
>> On 12/03/2024 11:02, Michael Tremer wrote:
>>> Thank you.
>>> I merged this for now so that we can fix this problem quickly.
>>> However I was wondering whether we should consider making the decode statement a part of the “cleanhtml” function.
>> That makes a lot of sense. It would also mean that the problem of umlauts etc would be fixed everywhere that cleanhtml is used rather than needing to fix every invocation of cleanhtml.
>>
>> I will look at putting something together for that.
> 
> This is a common problem with any kind of software. I suppose a lot of people are used to it and leave out any special characters where possible. I generally tend to comment in English because it’s shorter, doesn’t have the special character issue and allows customers to hire any staff where German isn’t their first language for example. For international customers I am using English anyways. Hence I never really ran into this problem.
> 
>>> I am still unsure why this is happening in the first place. We should be receiving UTF-8 from the browser, and I believe that perl doesn’t natively store things in UTF-8. That is however not a problem, because it should read files the same way it wrote them and so there should not be any difference when we re-read the configuration files. Unless some parts of the code specify any kind of encoding.
>> We do receive UTF-8 from the browser. The problem seems to be that the HTML::Entities::encode_entities command doesn't work with UTF-8 but with ISO-8859-1 encoding. I can't find where I found this the other day when I was searching on this topic to understand how to overcome it.
> 
> Ah, yeah that would make sense. So basically we store it correctly but once we pass it through cleanhtml() we mess it up. Okay. That is good to know.
> 
> So we just need to encode/decode back to UTF-8.
Two things with the previous patch I did.

Firstly I forgot to add

use Encode

at the top of the dns.cgi file. I had it in the vm test system but forgot to copy it across to the patch I created.

Secondly, I decode  the UTF-8 so cleanhtml works correctly with the text but I didn't then encode it back to UTF-8 after doing the html::entities cleaning as you have indicated would be the right approach.

I tested that today in my vm testbed and I confirm it works correctly, keeping the text with the correct html entity values but also then ensures that the text is back in UTF-8 format.

As the patch set from this original email has been merged into next I will submit a separate patch to add the use Encode and the encode back to UTF-8 lines for adding to next.

Regards,

Adolf.
> 
>> The fix is not encoding the text from the browser remark box into UTF-8 but decoding it from UTF-8. Once the text is in the files then it is fine.
> 
> We should always us UTF-8 everywhere. We might have some problems with Chinese or so (no idea really), but for 99% of our user-base UTF-8 should work fine.
> 
>> Of course my reasoning for doing the decoding may or may not be right, so I am always open to alternative suggestions.
> 
> I believe you found the right path :)
> 
> -Michael
> 
>> Regards,
>>
>> Adolf.
>>> -Michael
>>>> On 11 Mar 2024, at 12:19, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>>>
>>>> - If Freifunk München e.V. is entered as a remark it gets converted to
>>>>    Freifunk München e.V.
>>>> - This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
>>>>    and the HTML::Entities::encode_entities command that is run on that remark text does
>>>>    not work with UTF-8 text.
>>>> - If the UTF-8 text in the remark is decoded before running through the cleanhtml command
>>>>    then the characters with diacritical marks are correctly shown.
>>>> - Have tested out the fix on a remark with a range of different characters with
>>>>    diacritical marks and all of the ones tested were displayed correctly with the fix while
>>>>    in the original form they were mangled.
>>>>
>>>> Fixes: Bug#12395
>>>> Tested-by: Adolf Belka <adolf.belka@ipfire.org>
>>>> Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
>>>> ---
>>>> html/cgi-bin/dns.cgi | 7 +++++++
>>>> 1 file changed, 7 insertions(+)
>>>>
>>>> diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
>>>> index 0a34d3fd6..eb6f908d5 100644
>>>> --- a/html/cgi-bin/dns.cgi
>>>> +++ b/html/cgi-bin/dns.cgi
>>>> @@ -142,6 +142,13 @@ if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
>>>> # Go further if there was no error.
>>>> if ( ! $errormessage) {
>>>> # Check if a remark has been entered.
>>>> +
>>>> + # decode the UTF-8 text so that characters with diacritical marks such as
>>>> + # umlauts are treated correctly by the following cleanhtml command
>>>> + $cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
>>>> +
>>>> + # run the REMARK text through cleanhtml to ensure all unsafe html characters
>>>> + # are correctly encoded to their html entities
>>>> $cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
>>>>
>>>> my %dns_servers = ();
>>>> -- 
>>>> 2.44.0
>>>>
>>
>> -- 
>> Sent from my laptop
> 
>
  
Michael Tremer March 15, 2024, 10:53 a.m. UTC | #5
Hello,

> On 13 Mar 2024, at 22:05, Adolf Belka <adolf.belka@ipfire.org> wrote:
> 
> Hi Michael,
> 
> On 12/03/2024 15:56, Michael Tremer wrote:
>> Hello,
>>> On 12 Mar 2024, at 12:27, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>> 
>>> Hi Michael,
>>> 
>>> On 12/03/2024 11:02, Michael Tremer wrote:
>>>> Thank you.
>>>> I merged this for now so that we can fix this problem quickly.
>>>> However I was wondering whether we should consider making the decode statement a part of the “cleanhtml” function.
>>> That makes a lot of sense. It would also mean that the problem of umlauts etc would be fixed everywhere that cleanhtml is used rather than needing to fix every invocation of cleanhtml.
>>> 
>>> I will look at putting something together for that.
>> This is a common problem with any kind of software. I suppose a lot of people are used to it and leave out any special characters where possible. I generally tend to comment in English because it’s shorter, doesn’t have the special character issue and allows customers to hire any staff where German isn’t their first language for example. For international customers I am using English anyways. Hence I never really ran into this problem.
>>>> I am still unsure why this is happening in the first place. We should be receiving UTF-8 from the browser, and I believe that perl doesn’t natively store things in UTF-8. That is however not a problem, because it should read files the same way it wrote them and so there should not be any difference when we re-read the configuration files. Unless some parts of the code specify any kind of encoding.
>>> We do receive UTF-8 from the browser. The problem seems to be that the HTML::Entities::encode_entities command doesn't work with UTF-8 but with ISO-8859-1 encoding. I can't find where I found this the other day when I was searching on this topic to understand how to overcome it.
>> Ah, yeah that would make sense. So basically we store it correctly but once we pass it through cleanhtml() we mess it up. Okay. That is good to know.
>> So we just need to encode/decode back to UTF-8.
> Two things with the previous patch I did.
> 
> Firstly I forgot to add
> 
> use Encode

I try to always explicitly name the function that I call from a module. Like &Encode::encode(…) instead of just “encode()”. I am not sure whether that changes anything else but my feeling that this function belongs to a certain module.

> at the top of the dns.cgi file. I had it in the vm test system but forgot to copy it across to the patch I created.
> 
> Secondly, I decode  the UTF-8 so cleanhtml works correctly with the text but I didn't then encode it back to UTF-8 after doing the html::entities cleaning as you have indicated would be the right approach.
> 
> I tested that today in my vm testbed and I confirm it works correctly, keeping the text with the correct html entity values but also then ensures that the text is back in UTF-8 format.
> 
> As the patch set from this original email has been merged into next I will submit a separate patch to add the use Encode and the encode back to UTF-8 lines for adding to next.
> 
> Regards,
> 
> Adolf.
>>> The fix is not encoding the text from the browser remark box into UTF-8 but decoding it from UTF-8. Once the text is in the files then it is fine.
>> We should always us UTF-8 everywhere. We might have some problems with Chinese or so (no idea really), but for 99% of our user-base UTF-8 should work fine.
>>> Of course my reasoning for doing the decoding may or may not be right, so I am always open to alternative suggestions.
>> I believe you found the right path :)
>> -Michael
>>> Regards,
>>> 
>>> Adolf.
>>>> -Michael
>>>>> On 11 Mar 2024, at 12:19, Adolf Belka <adolf.belka@ipfire.org> wrote:
>>>>> 
>>>>> - If Freifunk München e.V. is entered as a remark it gets converted to
>>>>>   Freifunk München e.V.
>>>>> - This is because cleanhtml is used on the UTF-8 remark text before saving it to the file
>>>>>   and the HTML::Entities::encode_entities command that is run on that remark text does
>>>>>   not work with UTF-8 text.
>>>>> - If the UTF-8 text in the remark is decoded before running through the cleanhtml command
>>>>>   then the characters with diacritical marks are correctly shown.
>>>>> - Have tested out the fix on a remark with a range of different characters with
>>>>>   diacritical marks and all of the ones tested were displayed correctly with the fix while
>>>>>   in the original form they were mangled.
>>>>> 
>>>>> Fixes: Bug#12395
>>>>> Tested-by: Adolf Belka <adolf.belka@ipfire.org>
>>>>> Signed-off-by: Adolf Belka <adolf.belka@ipfire.org>
>>>>> ---
>>>>> html/cgi-bin/dns.cgi | 7 +++++++
>>>>> 1 file changed, 7 insertions(+)
>>>>> 
>>>>> diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
>>>>> index 0a34d3fd6..eb6f908d5 100644
>>>>> --- a/html/cgi-bin/dns.cgi
>>>>> +++ b/html/cgi-bin/dns.cgi
>>>>> @@ -142,6 +142,13 @@ if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
>>>>> # Go further if there was no error.
>>>>> if ( ! $errormessage) {
>>>>> # Check if a remark has been entered.
>>>>> +
>>>>> + # decode the UTF-8 text so that characters with diacritical marks such as
>>>>> + # umlauts are treated correctly by the following cleanhtml command
>>>>> + $cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
>>>>> +
>>>>> + # run the REMARK text through cleanhtml to ensure all unsafe html characters
>>>>> + # are correctly encoded to their html entities
>>>>> $cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
>>>>> 
>>>>> my %dns_servers = ();
>>>>> -- 
>>>>> 2.44.0
>>>>> 
>>> 
>>> -- 
>>> Sent from my laptop
  

Patch

diff --git a/html/cgi-bin/dns.cgi b/html/cgi-bin/dns.cgi
index 0a34d3fd6..eb6f908d5 100644
--- a/html/cgi-bin/dns.cgi
+++ b/html/cgi-bin/dns.cgi
@@ -142,6 +142,13 @@  if (($cgiparams{'SERVERS'} eq $Lang::tr{'save'}) || ($cgiparams{'SERVERS'} eq $L
 	# Go further if there was no error.
 	if ( ! $errormessage) {
 		# Check if a remark has been entered.
+
+		# decode the UTF-8 text so that characters with diacritical marks such as
+		# umlauts are treated correctly by the following cleanhtml command
+		$cgiparams{'REMARK'} = decode("UTF-8", $cgiparams{'REMARK'});
+
+		# run the REMARK text through cleanhtml to ensure all unsafe html characters
+		# are correctly encoded to their html entities
 		$cgiparams{'REMARK'} = &Header::cleanhtml($cgiparams{'REMARK'});
 
 		my %dns_servers = ();