location-importer.in: Import (technical) AS names from ARIN

Message ID 20210608121036.16242-1-peter.mueller@ipfire.org
State Superseded
Headers
Series location-importer.in: Import (technical) AS names from ARIN |

Commit Message

Peter Müller June 8, 2021, 12:10 p.m. UTC
  ARIN and LACNIC, unfortunately, do not seem to publish data containing
human readable AS names. For the former, we at least have a list of
tecnical names, which this patch fetches and inserts into the autnums
table.

While some of them do not seem to be suitable for human consumption (i.
e. being very cryptic), providing these data might be helpful
neverthelesss.

Signed-off-by: Peter Müller <peter.mueller@ipfire.org>
---
 src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)
  

Comments

Michael Tremer June 8, 2021, 2:40 p.m. UTC | #1
Hello,

Good code. I am sure it works.

I just want to use this a little bit of a Python exercise because I am quite particular about my Python :)

> On 8 Jun 2021, at 13:10, Peter Müller <peter.mueller@ipfire.org> wrote:
> 
> ARIN and LACNIC, unfortunately, do not seem to publish data containing
> human readable AS names. For the former, we at least have a list of
> tecnical names, which this patch fetches and inserts into the autnums
> table.
> 
> While some of them do not seem to be suitable for human consumption (i.
> e. being very cryptic), providing these data might be helpful
> neverthelesss.
> 
> Signed-off-by: Peter Müller <peter.mueller@ipfire.org>
> ---
> src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
> 
> diff --git a/src/python/location-importer.in b/src/python/location-importer.in
> index aa3b8f7..2a9bf33 100644
> --- a/src/python/location-importer.in
> +++ b/src/python/location-importer.in
> @@ -505,6 +505,9 @@ class CLI(object):
> 						for line in f:
> 							self._parse_line(line, source_key, validcountries)
> 
> +		# Download and import (technical) AS names from ARIN
> +		self._import_as_names_from_arin()
> +
> 	def _check_parsed_network(self, network):
> 		"""
> 			Assistive function to detect and subsequently sort out parsed
> @@ -775,6 +778,64 @@ class CLI(object):
> 			"%s" % network, country, [country], source_key,
> 		)
> 
> +	def _import_as_names_from_arin(self):
> +		downloader = location.importer.Downloader()
> +
> +		# XXX: Download AS names file from ARIN (note that these names appear to be quite
> +		# technical, not intended for human consumption, as description fields in
> +		# organisation handles for other RIRs are - however, this is what we have got,
> +		# and in some cases, it might be still better than nothing)
> +		try:
> +			with downloader.request("https://ftp.arin.net/info/asn.txt", return_blocks=False) as f:
> +				arin_as_names_file = f.body

You copy the entire content into memory. That is something you can do, but the whole point of the download handler was so that a large file can be read line by line… In this case memory is probably not a concern, but generally I do not like aliasing something.

> +		except Exception as e:
> +			log.error("failed to download and preprocess AS name file from ARIN: %s" % e)
> +			return

Do we just want to continue if this fails?

> +
> +		# Split downloaded body into lines and parse each of them...
> +		for sline in arin_as_names_file.readlines():

Iterating over f (above) would make the readlines() call unnecessary. It creates a list with the lines of the file. If memory was a concern, this is quite inefficient. Reading one line at a time would be better.

> +
> +			# ... valid lines start with a space, followed by the number of the Autonomous System ...
> +			if not sline.startswith(b" "):
> +				continue

I usually convert bytes into strings straight away (unless I can perform some cheap checks before).

> +
> +			# Split line and check if there is a valid ASN in it...
> +			scontents = sline.split()
> +			try:
> +				asn = int(scontents[0])
> +			except ValueError:
> +				log.debug("Skipping ARIN AS names line not containing an integer for ASN")
> +				continue

Here is where it is getting interesting. You are using lots and lots of lines to do something very simple: Split a string in two parts.

You can either do this:

  asn, space, name = line.partition(“ “)

Or you can use split with a maximum:

  asn, name = line.split(“ “, 1)

That avoids the whole aliasing thing with scontent[0] later. The variable has a good name straight away and you will never need to change any array indices, because that is messy and a waste of time.

I also like brief variable names. as_name should be name because context makes it clear that it is the name of the AS, doesn’t it?

> +
> +			if not ((1 <= asn and asn <= 23455) or (23457 <= asn and asn <= 64495) or (131072 <= asn and asn <= 4199999999)):
> +				log.debug("Skipping ARIN AS names line not containing a valid ASN: %s" % asn)
> +				continue

I believe we are using this in other places, so it could become a function that can be reused.

> +
> +			# Skip any AS name that appears to be a placeholder for a different RIR or entity...
> +			as_name = scontents[1].decode("ascii")

This could fail and you would throw an exception.

> +			if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d{0,1}-*", as_name):
> +				continue

This is a good solution. To make the regular expression faster, you could end after the \d+. You are checking if there is no or many dashes after it. I am sure if that is what you intended or if that is an attempt to use globbing.

> +			# Bail out in case the AS name contains anything we do not expect here...
> +			if re.search(r"[^a-zA-Z0-9-_]", as_name):
> +				log.debug("Skipping ARIN AS name for %s containing invalid characters: %s" % \
> +						(asn, as_name))

Hmm. What are you trying to do here? Find non-printable characters? What about special characters like é, ä, ñ and so on?

> +
> +			# Things look good here, run INSERT statement and skip this one if we already have
> +			# a (better?) name for this Autonomous System...
> +			self.db.execute("""
> +				INSERT INTO autnums(
> +					number,
> +					name,
> +					source
> +				) VALUES (%s, %s, %s)
> +				ON CONFLICT (number) DO NOTHING""",
> +				asn,
> +				as_name,
> +				"ARIN",
> +			)
> +
> 	def handle_update_announcements(self, ns):
> 		server = ns.server[0]
> 
> -- 
> 2.20.1
> 

This is just my attempt to make you more aware about some things in the code that might not be very efficient (when it might not matter that much), and what my personal “style” is. Others might disagree, but in general my code performs very well and is very readable. I hope :)

-Michael
  
Peter Müller June 8, 2021, 3:10 p.m. UTC | #2
Hello Michael,

thanks for your reply.

> Hello,
> 
> Good code. I am sure it works.

Well, I tested it on location02, and would be a bit puzzled if it does not work somewhere else...

> I just want to use this a little bit of a Python exercise because I am quite particular about my Python :)
> 
>> On 8 Jun 2021, at 13:10, Peter Müller <peter.mueller@ipfire.org> wrote:
>>
>> ARIN and LACNIC, unfortunately, do not seem to publish data containing
>> human readable AS names. For the former, we at least have a list of
>> tecnical names, which this patch fetches and inserts into the autnums
>> table.
>>
>> While some of them do not seem to be suitable for human consumption (i.
>> e. being very cryptic), providing these data might be helpful
>> neverthelesss.
>>
>> Signed-off-by: Peter Müller <peter.mueller@ipfire.org>
>> ---
>> src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++
>> 1 file changed, 61 insertions(+)
>>
>> diff --git a/src/python/location-importer.in b/src/python/location-importer.in
>> index aa3b8f7..2a9bf33 100644
>> --- a/src/python/location-importer.in
>> +++ b/src/python/location-importer.in
>> @@ -505,6 +505,9 @@ class CLI(object):
>> 						for line in f:
>> 							self._parse_line(line, source_key, validcountries)
>>
>> +		# Download and import (technical) AS names from ARIN
>> +		self._import_as_names_from_arin()
>> +
>> 	def _check_parsed_network(self, network):
>> 		"""
>> 			Assistive function to detect and subsequently sort out parsed
>> @@ -775,6 +778,64 @@ class CLI(object):
>> 			"%s" % network, country, [country], source_key,
>> 		)
>>
>> +	def _import_as_names_from_arin(self):
>> +		downloader = location.importer.Downloader()
>> +
>> +		# XXX: Download AS names file from ARIN (note that these names appear to be quite
>> +		# technical, not intended for human consumption, as description fields in
>> +		# organisation handles for other RIRs are - however, this is what we have got,
>> +		# and in some cases, it might be still better than nothing)
>> +		try:
>> +			with downloader.request("https://ftp.arin.net/info/asn.txt", return_blocks=False) as f:
>> +				arin_as_names_file = f.body
> 
> You copy the entire content into memory. That is something you can do, but the whole point of the download handler was so that a large file can be read line by line… In this case memory is probably not a concern, but generally I do not like aliasing something.

ACK.

> 
>> +		except Exception as e:
>> +			log.error("failed to download and preprocess AS name file from ARIN: %s" % e)
>> +			return
> 
> Do we just want to continue if this fails?

What for? If we cannot download the asn.txt file from ARIN, there is nothing left to do in this function.

> 
>> +
>> +		# Split downloaded body into lines and parse each of them...
>> +		for sline in arin_as_names_file.readlines():
> 
> Iterating over f (above) would make the readlines() call unnecessary. It creates a list with the lines of the file. If memory was a concern, this is quite inefficient. Reading one line at a time would be better.

ACK.

> 
>> +
>> +			# ... valid lines start with a space, followed by the number of the Autonomous System ...
>> +			if not sline.startswith(b" "):
>> +				continue
> 
> I usually convert bytes into strings straight away (unless I can perform some cheap checks before).

This is just a personal habit of mine: Unless I can reasonably assume something binary looks like valid input,
I never peek too much into it. However, str() should be safe to use in Python, so I guess your approach is better.

> 
>> +
>> +			# Split line and check if there is a valid ASN in it...
>> +			scontents = sline.split()
>> +			try:
>> +				asn = int(scontents[0])
>> +			except ValueError:
>> +				log.debug("Skipping ARIN AS names line not containing an integer for ASN")
>> +				continue
> 
> Here is where it is getting interesting. You are using lots and lots of lines to do something very simple: Split a string in two parts.
> 
> You can either do this:
> 
>   asn, space, name = line.partition(“ “)
> 
> Or you can use split with a maximum:
> 
>   asn, name = line.split(“ “, 1)
> 
> That avoids the whole aliasing thing with scontent[0] later. The variable has a good name straight away and you will never need to change any array indices, because that is messy and a waste of time.

ACK.

> 
> I also like brief variable names. as_name should be name because context makes it clear that it is the name of the AS, doesn’t it?

ACK.

> 
>> +
>> +			if not ((1 <= asn and asn <= 23455) or (23457 <= asn and asn <= 64495) or (131072 <= asn and asn <= 4199999999)):
>> +				log.debug("Skipping ARIN AS names line not containing a valid ASN: %s" % asn)
>> +				continue
> 
> I believe we are using this in other places, so it could become a function that can be reused.

Yes, we have that for updating the BGP feed, but placed the checks inside a SQL statement there, so I did not came up
with a good idea on how to recycle that. Is it fine to you if I just leave this as it is for the time being?

> 
>> +
>> +			# Skip any AS name that appears to be a placeholder for a different RIR or entity...
>> +			as_name = scontents[1].decode("ascii")
> 
> This could fail and you would throw an exception.

Yes, that slipped my mind...

> 
>> +			if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d{0,1}-*", as_name):
>> +				continue
> 
> This is a good solution. To make the regular expression faster, you could end after the \d+. You are checking if there is no or many dashes after it. I am sure if that is what you intended or if that is an attempt to use globbing.

Not exactly. The hyphen after one or no digit is crucial. But it could be something like \d? (PCRE thinking)...

> 
>> +			# Bail out in case the AS name contains anything we do not expect here...
>> +			if re.search(r"[^a-zA-Z0-9-_]", as_name):
>> +				log.debug("Skipping ARIN AS name for %s containing invalid characters: %s" % \
>> +						(asn, as_name))
> 
> Hmm. What are you trying to do here? Find non-printable characters? What about special characters like é, ä, ñ and so on?

They should not appear in this file. AS names consist of ASCII letters, digits, and hyphens only.

> 
>> +
>> +			# Things look good here, run INSERT statement and skip this one if we already have
>> +			# a (better?) name for this Autonomous System...
>> +			self.db.execute("""
>> +				INSERT INTO autnums(
>> +					number,
>> +					name,
>> +					source
>> +				) VALUES (%s, %s, %s)
>> +				ON CONFLICT (number) DO NOTHING""",
>> +				asn,
>> +				as_name,
>> +				"ARIN",
>> +			)
>> +
>> 	def handle_update_announcements(self, ns):
>> 		server = ns.server[0]
>>
>> -- 
>> 2.20.1
>>
> 
> This is just my attempt to make you more aware about some things in the code that might not be very efficient (when it might not matter that much), and what my personal “style” is. Others might disagree, but in general my code performs very well and is very readable. I hope :)

Thank you. I will hand in a second version of this patch...

Thanks, and best regards,
Peter Müller
  
Michael Tremer June 8, 2021, 3:15 p.m. UTC | #3
Hello,

> On 8 Jun 2021, at 16:10, Peter Müller <peter.mueller@ipfire.org> wrote:
> 
> Hello Michael,
> 
> thanks for your reply.
> 
>> Hello,
>> 
>> Good code. I am sure it works.
> 
> Well, I tested it on location02, and would be a bit puzzled if it does not work somewhere else...
> 

Exactly what I expected :)

>> I just want to use this a little bit of a Python exercise because I am quite particular about my Python :)
>> 
>>> On 8 Jun 2021, at 13:10, Peter Müller <peter.mueller@ipfire.org> wrote:
>>> 
>>> ARIN and LACNIC, unfortunately, do not seem to publish data containing
>>> human readable AS names. For the former, we at least have a list of
>>> tecnical names, which this patch fetches and inserts into the autnums
>>> table.
>>> 
>>> While some of them do not seem to be suitable for human consumption (i.
>>> e. being very cryptic), providing these data might be helpful
>>> neverthelesss.
>>> 
>>> Signed-off-by: Peter Müller <peter.mueller@ipfire.org>
>>> ---
>>> src/python/location-importer.in | 61 +++++++++++++++++++++++++++++++++
>>> 1 file changed, 61 insertions(+)
>>> 
>>> diff --git a/src/python/location-importer.in b/src/python/location-importer.in
>>> index aa3b8f7..2a9bf33 100644
>>> --- a/src/python/location-importer.in
>>> +++ b/src/python/location-importer.in
>>> @@ -505,6 +505,9 @@ class CLI(object):
>>> 						for line in f:
>>> 							self._parse_line(line, source_key, validcountries)
>>> 
>>> +		# Download and import (technical) AS names from ARIN
>>> +		self._import_as_names_from_arin()
>>> +
>>> 	def _check_parsed_network(self, network):
>>> 		"""
>>> 			Assistive function to detect and subsequently sort out parsed
>>> @@ -775,6 +778,64 @@ class CLI(object):
>>> 			"%s" % network, country, [country], source_key,
>>> 		)
>>> 
>>> +	def _import_as_names_from_arin(self):
>>> +		downloader = location.importer.Downloader()
>>> +
>>> +		# XXX: Download AS names file from ARIN (note that these names appear to be quite
>>> +		# technical, not intended for human consumption, as description fields in
>>> +		# organisation handles for other RIRs are - however, this is what we have got,
>>> +		# and in some cases, it might be still better than nothing)
>>> +		try:
>>> +			with downloader.request("https://ftp.arin.net/info/asn.txt", return_blocks=False) as f:
>>> +				arin_as_names_file = f.body
>> 
>> You copy the entire content into memory. That is something you can do, but the whole point of the download handler was so that a large file can be read line by line… In this case memory is probably not a concern, but generally I do not like aliasing something.
> 
> ACK.
> 
>> 
>>> +		except Exception as e:
>>> +			log.error("failed to download and preprocess AS name file from ARIN: %s" % e)
>>> +			return
>> 
>> Do we just want to continue if this fails?
> 
> What for? If we cannot download the asn.txt file from ARIN, there is nothing left to do in this function.

True, but should we not raise an error instead of silently logging this?

>> 
>>> +
>>> +		# Split downloaded body into lines and parse each of them...
>>> +		for sline in arin_as_names_file.readlines():
>> 
>> Iterating over f (above) would make the readlines() call unnecessary. It creates a list with the lines of the file. If memory was a concern, this is quite inefficient. Reading one line at a time would be better.
> 
> ACK.
> 
>> 
>>> +
>>> +			# ... valid lines start with a space, followed by the number of the Autonomous System ...
>>> +			if not sline.startswith(b" "):
>>> +				continue
>> 
>> I usually convert bytes into strings straight away (unless I can perform some cheap checks before).
> 
> This is just a personal habit of mine: Unless I can reasonably assume something binary looks like valid input,
> I never peek too much into it. However, str() should be safe to use in Python, so I guess your approach is better.

Hmm, I suppose we can call Python memory-safe. Since it is using glibc’s string functions underneath, it would have less overhead and take advantage of AVX, SSE, etc. and therefore have less overhead to only call it once per line.

Even better would be to only call it once per file, but then you might run into memory allocation issues again.

> 
>> 
>>> +
>>> +			# Split line and check if there is a valid ASN in it...
>>> +			scontents = sline.split()
>>> +			try:
>>> +				asn = int(scontents[0])
>>> +			except ValueError:
>>> +				log.debug("Skipping ARIN AS names line not containing an integer for ASN")
>>> +				continue
>> 
>> Here is where it is getting interesting. You are using lots and lots of lines to do something very simple: Split a string in two parts.
>> 
>> You can either do this:
>> 
>>  asn, space, name = line.partition(“ “)
>> 
>> Or you can use split with a maximum:
>> 
>>  asn, name = line.split(“ “, 1)
>> 
>> That avoids the whole aliasing thing with scontent[0] later. The variable has a good name straight away and you will never need to change any array indices, because that is messy and a waste of time.
> 
> ACK.
> 
>> 
>> I also like brief variable names. as_name should be name because context makes it clear that it is the name of the AS, doesn’t it?
> 
> ACK.
> 
>> 
>>> +
>>> +			if not ((1 <= asn and asn <= 23455) or (23457 <= asn and asn <= 64495) or (131072 <= asn and asn <= 4199999999)):
>>> +				log.debug("Skipping ARIN AS names line not containing a valid ASN: %s" % asn)
>>> +				continue
>> 
>> I believe we are using this in other places, so it could become a function that can be reused.
> 
> Yes, we have that for updating the BGP feed, but placed the checks inside a SQL statement there, so I did not came up
> with a good idea on how to recycle that. Is it fine to you if I just leave this as it is for the time being?

I am okay with this.

> 
>> 
>>> +
>>> +			# Skip any AS name that appears to be a placeholder for a different RIR or entity...
>>> +			as_name = scontents[1].decode("ascii")
>> 
>> This could fail and you would throw an exception.
> 
> Yes, that slipped my mind...
> 
>> 
>>> +			if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d{0,1}-*", as_name):
>>> +				continue
>> 
>> This is a good solution. To make the regular expression faster, you could end after the \d+. You are checking if there is no or many dashes after it. I am sure if that is what you intended or if that is an attempt to use globbing.
> 
> Not exactly. The hyphen after one or no digit is crucial. But it could be something like \d? (PCRE thinking)...

You are checking for none to many. So it technically does not matter what comes after \d+

You probably want this:

  ^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d\-+

> 
>> 
>>> +			# Bail out in case the AS name contains anything we do not expect here...
>>> +			if re.search(r"[^a-zA-Z0-9-_]", as_name):
>>> +				log.debug("Skipping ARIN AS name for %s containing invalid characters: %s" % \
>>> +						(asn, as_name))
>> 
>> Hmm. What are you trying to do here? Find non-printable characters? What about special characters like é, ä, ñ and so on?
> 
> They should not appear in this file. AS names consist of ASCII letters, digits, and hyphens only.

Okay. Let’s hope that people will follow this.

>> 
>>> +
>>> +			# Things look good here, run INSERT statement and skip this one if we already have
>>> +			# a (better?) name for this Autonomous System...
>>> +			self.db.execute("""
>>> +				INSERT INTO autnums(
>>> +					number,
>>> +					name,
>>> +					source
>>> +				) VALUES (%s, %s, %s)
>>> +				ON CONFLICT (number) DO NOTHING""",
>>> +				asn,
>>> +				as_name,
>>> +				"ARIN",
>>> +			)
>>> +
>>> 	def handle_update_announcements(self, ns):
>>> 		server = ns.server[0]
>>> 
>>> -- 
>>> 2.20.1
>>> 
>> 
>> This is just my attempt to make you more aware about some things in the code that might not be very efficient (when it might not matter that much), and what my personal “style” is. Others might disagree, but in general my code performs very well and is very readable. I hope :)
> 
> Thank you. I will hand in a second version of this patch...

Up to you if you want to :)

-Michael

> 
> Thanks, and best regards,
> Peter Müller
  

Patch

diff --git a/src/python/location-importer.in b/src/python/location-importer.in
index aa3b8f7..2a9bf33 100644
--- a/src/python/location-importer.in
+++ b/src/python/location-importer.in
@@ -505,6 +505,9 @@  class CLI(object):
 						for line in f:
 							self._parse_line(line, source_key, validcountries)
 
+		# Download and import (technical) AS names from ARIN
+		self._import_as_names_from_arin()
+
 	def _check_parsed_network(self, network):
 		"""
 			Assistive function to detect and subsequently sort out parsed
@@ -775,6 +778,64 @@  class CLI(object):
 			"%s" % network, country, [country], source_key,
 		)
 
+	def _import_as_names_from_arin(self):
+		downloader = location.importer.Downloader()
+
+		# XXX: Download AS names file from ARIN (note that these names appear to be quite
+		# technical, not intended for human consumption, as description fields in
+		# organisation handles for other RIRs are - however, this is what we have got,
+		# and in some cases, it might be still better than nothing)
+		try:
+			with downloader.request("https://ftp.arin.net/info/asn.txt", return_blocks=False) as f:
+				arin_as_names_file = f.body
+		except Exception as e:
+			log.error("failed to download and preprocess AS name file from ARIN: %s" % e)
+			return
+
+		# Split downloaded body into lines and parse each of them...
+		for sline in arin_as_names_file.readlines():
+
+			# ... valid lines start with a space, followed by the number of the Autonomous System ...
+			if not sline.startswith(b" "):
+				continue
+
+			# Split line and check if there is a valid ASN in it...
+			scontents = sline.split()
+			try:
+				asn = int(scontents[0])
+			except ValueError:
+				log.debug("Skipping ARIN AS names line not containing an integer for ASN")
+				continue
+
+			if not ((1 <= asn and asn <= 23455) or (23457 <= asn and asn <= 64495) or (131072 <= asn and asn <= 4199999999)):
+				log.debug("Skipping ARIN AS names line not containing a valid ASN: %s" % asn)
+				continue
+
+			# Skip any AS name that appears to be a placeholder for a different RIR or entity...
+			as_name = scontents[1].decode("ascii")
+
+			if re.match(r"^(ASN-BLK|)(AFCONC|AFRINIC|APNIC|ASNBLK|DNIC|LACNIC|RIPE|IANA)\d{0,1}-*", as_name):
+				continue
+
+			# Bail out in case the AS name contains anything we do not expect here...
+			if re.search(r"[^a-zA-Z0-9-_]", as_name):
+				log.debug("Skipping ARIN AS name for %s containing invalid characters: %s" % \
+						(asn, as_name))
+
+			# Things look good here, run INSERT statement and skip this one if we already have
+			# a (better?) name for this Autonomous System...
+			self.db.execute("""
+				INSERT INTO autnums(
+					number,
+					name,
+					source
+				) VALUES (%s, %s, %s)
+				ON CONFLICT (number) DO NOTHING""",
+				asn,
+				as_name,
+				"ARIN",
+			)
+
 	def handle_update_announcements(self, ns):
 		server = ns.server[0]