Web scraping in Python with absolute URL.
How to web scrape backlinks from websites using the U.S. Census.gov website.
First I called the BeautifulSoup library, which is a parsing tree for html and xml text, then using the “html.parser” associated with BeautifulSoup. Then I had to retrieve the link from census.gov, first by creating a variable called “url” and setting it equal to the exact URL indicated in the assessment, “https://www.census.gov/programs-surveys/popest.html”. Then I created another variable called “r” and set that equal to the “requests” method, imported from the “requests” library along with the “get()” function. In the get() function, was nested the “url” variable; This line of code fetched the html website.
Then I created a variable “soup” by calling the BeautifulSoup function, that took the argument “r”, isolated the html content nested along with another command “html.parser”; this line of code initialized the parsing tree and isolated the html text (crummy.com). Next, I created an object called “links”, set equal to “soup” object, and appended the “find_all()” function; searching for all html anchor tags, which indicated a link outside the website. The ‘a’ in the following code is searching for the ‘a’ in the html anchor tag indicated as <a>. At this point, all links where extracted and inserted into the “links” object.
Next, I coded a for-loop that iterated through the newly created “links” object and fetches all links with the attribute “href”, utilizing the get() function.
links =soup.find_all(“a”)
for link in links:
link.get(“href”)
The anchor tag <a> in html code signifies a link to another source. If an anchor tag is present in the HTML code, that is an indication that it links to another source outside of the webpage. I coded a for-loop that uses the “find_all()” function, which is searching for all html anchor tags <a>, with the attributes ‘href’, which indicates a URL to an outside website. “The ‘href’ tag is what determines that the anchor tag link is located to another HTML page and the details of the link to that page” ( docs.python.org). These link details could be relative or absolute. According to w3schools.org, Uniform Resource Locators (URL) have three pieces to be absolute:
1. A scheme identifying the protocol is used to access the resource
2. The name of the machine hosting the resource.
3. The name of the resource itself, given as a path.
Furthermore, absolute links start with “http://”; HTTP is the protocol followed by the machine information such as “http://www.acme.gov”. Relative links only have the third piece of information and start with “/”. For example, “/link/would_be_relative.html/” or “../link/is relative.html#” (w3schools.org). The following code produced the weblinks.
links =soup.find_all(“a”)
for link in links:
link.get(“href”)
First, I created a function “unique_links” that takes two arguments: the product of previous for-loop, that wrangled all the links with “href” tags and the “url” object. Within this function is the object “cleaned_links” which is a “set()” method, to store the cleaned links as a product of this function. Then once again, I inserted the previous for-loop to ensure the all anchor tags with an “href” tag are wrangled. Next, the two subsequent ‘if’ statements where used to clean up all links. The first ‘if’ statement, ensures that while iterating through each link of the “links” object, if there is no link ‘None’, then continue to the next link and execute the code.
The following ‘if’ statement cleans the characters off the end of the link. Providing a clean URL ending of the link and taking away common fragment identifiers such as “/” and “#” (w3schools.org). This is done by using the “endswith()” function and the subsequent command “link =[-1]” indicating the last element in an array (developers.google.com). The next object created is the “abs_url”, which parses apart, the now tidied links, from relative link to absolute. It then uses to “urljoin()” method to concatenate the protocol and hosting machine with the relative link, making all links absolute and not relative. The absolute links are then added to the previously created “cleaned_links” set by passing the “abs_url” into the “add()” function. Lastly, the function returns the “cleaned_links” set. The last line of code located outside of the function,“cleaned_links = unique_links(links,url)”, is setting the “cleaned_links” object to call the newly created “unique_links()” function and passing two arguments: the “href” tags and the “url” object. Here is the following code that executes this:
def unique_links(tags, url):
cleaned_links =set() #this set variable ensures no duplicates.
for link in links:
link =link.get("href")
if link is None:
continue
if link.endswith('/') or link.endswith('#'):
link =link[-1]. #this takes the fragment identifiers '/' and '#' off the ends of links.
abs_url =urllib.parse.urljoin(url,link)
cleaned_links.add(abs_url) #Adding the product, absolute url links to the set.
return cleaned_links
cleaned_links =unique_links(links,url)
Within the function “unique_links” the object “cleaned_links” was created as a “set()” method. According to GeeksforGeeks.org, “the set() method is used to convert any of the iterable elements in a sequence to iterable elements with distinct elements, commonly called a Set”. All elements in a set are unique and a set will not allow for duplicate strings or items. The “set()” function produces unique urls. “Sets are unordered collections of distinct objects and do not allow for duplicates” (GeeksforGeeks.org).
def unique_links(tags, url):
cleaned_links =set() #this set variable ensures no duplicates.
Further down the function is the code that adds the absolute urls, “abs_url”, to the “cleaned_links” set. This code results in no redundant links being added to the “cleaned_links” object because it is equal to the “set()” method.
abs_url =urllib.parse.urljoin(url,link)
cleaned_links.add(abs_url)
Python code written to extract all the unique web links from the HTML code of the “Current Estimates” (in the web links section), that point out to other HTML pages.
import requests
import csv
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
url ="https://www.census.gov/programs-surveys/popest.html"
r =requests.get(url) #fetching the webpage
soup = BeautifulSoup(r.content, "html.parser") #creating a BeatuifulSoup object to parse text.
links = soup.find_all("a")
for link in links:
link.get("href")
def unique_links(tags, url): #defining a function that takes two arguements, a tags and absolute url.
cleaned_links =set() #this set variable ensures no duplicates.
for link in links:
link =link.get("href")
if link is None:
continue
if link.endswith('/') or link.endswith('#'):
link =link[-1]
abs_url =urllib.parse.urljoin(url,link)
cleaned_links.add(abs_url)
return cleaned_links
cleaned_links =unique_links(links,url)
with open("Task_AAM1_C996.csv", "w") as csv_file:
writer = csv.writer(csv_file,delimiter="\n")
writer.writerow(cleaned_links)
Addtionaly, the html source code from the indicated website has been saved as:
Census.gov_html_source_code.txt
The file has been saved to Task_AAM1_C996.csv
A screenshot of the results.
Work Cited
(n.d.). Retrieved from https://developers.google.com/edu/python/strings#:~:text=As%20an%20alternative%2C%20Python%20uses,char%20(1st%20from%20the%20end)
5 HTML and URLs. (n.d.). Retrieved November 26, 2020, from https://www.w3.org/TR/WD-html40-970917/htmlweb.html
Beautiful Soup Documentation¶. (n.d.). Retrieved November 23, 2020, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Manjeet_04Check out this Author's contributed articles., Manjeet_04, & Check out this Author's contributed articles. (2020, April 24). Python: Set() method. Retrieved November 23, 2020, from https://www.geeksforgeeks.org/python-set-method/