Using Alpha Five Web Development Tools to Crawl and Scrape Other Websites

Blog



Using Alpha Five Web Development Tools to Crawl and Scrape Other Websites


We recently received a question from an Alpha developer asking how to take advantage of Alpha Five's web development tools to create crawling and scraping applications. Here's what Selwyn Rabins, Alpha Software's CTO, had to say:

Creating applications in Alpha Five v11 that do crawling and scraping is extremely easy. Alpha Five's programming language, Xbasic, has a number of built in functions that you can use:
http_get_page2()
  • This function will take a url as input and return the contents of the page.
extract_all_strings() and extract_string()
  • This function can be used to easily extract text from a page. 

In addition, Alpha Five has a built in XML Parser for more advanced scraping and the 'stringscanner'>object can also be used for advanced text extraction. The following is an example of how you can scrape the USPS web site to get the city and state name for a given zip code:

dim zip5 as c = "02139"
dim usps_citystate_from_zip.city    as c = ""
dim usps_citystate_from_zip.state    as c = ""

dim citystate_text    as c = ""
dim post_body    as c = ""
dim string_end    as c = ""
dim string_start    as c = ""
dim usps_result    as c = ""

zip5 = left(ut(zip5),5)
string_start    = "<p class=\"std-address\">"
string_end    =    "</p>"

'this is all one line - shown as wrapped because of page width
'get the response from the USPS web site into an Xbasic string variable

usps_result = http_get_page2("https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=2&companyName=&address1=&address2=&city=&state=Select&urbanCode=&postalCode=" + zip5 + "&zip=")

'scrape the citystate text from the response using the extract_string() function
citystate_text = extract_string(usps_result,string_start,string_end)
state    = word(citystate_text,-1)
city    = alltrim(left(citystate_text,len(citystate_text)-len(usps_citystate_from_zip.state)))

A Custom Function for Crawling and Scraping Using Alpha Five Web Development Tools

      Let's build a new function called CrawlAndScrape(). This function will allow me to pass it a URL of a web page. From there the function downloads the contents of the web page, and looks for any hyperlinks on the page that point to other pages on the same website.
      Next, the function follows each of those links (a process sometimes called "Crawling") and looks for the titles of each of the page, which it copies to a list (in a process sometimes called "Scraping.") Once all of the titles from the linked pages are collected they are returned in the function.

The Script in Action

      Here is what happens if we pass the CrawlAndScrape function the URL of Alpha Software's website.
? CrawlAndScrape("//www.alphasoftware.com")
=
About Us | Alpha Software
Platform | Alpha Software
Services | Alpha Software
Support | Alpha Software
Contact | Alpha Software
Shop | Alpha Subscriptions
Platform | Alpha Software
About Us | Alpha Software
Shop | Alpha Subscriptions
Support | Alpha Software
Services | Alpha Software
Video Library | Alpha Software

The Script Itself

      Want to see how the script was written? It's all below with comments explaining each section.
'Date Created: 29-Jan-2013 02:28:51 PM
'Last Updated: 29-Jan-2013 02:28:51 PM
'Created By : dave mccormick

FUNCTION CrawlAndScrape AS C ( url as C )

'STEP 1: Download the contents of the website specified in the incoming URL variable
'The http_get_page2() function is built into Alpha Five. Just pass it a URL
'and it will pass you back all of the HTML code for that page.

x = http_get_page2(URL)

'STEP 2: Extract all of the URLs that are contained in the page's <A> tags
'specifically we're looking for the part contained in the href parameter of the
'tag so we'll use HREF=" as the beginning of where we want to extract data
'The extract_all_strings() functions is also built into Alpha Five. Just pass it
'text such as XML or HTML and tell it how to identify the beginning and ending
'data you are interested. The function returns a list of values in a CRLF
'list format. In this case we told it to look for data that begins as href="
'and to stop looking when it finds a closing " mark. The \ is an escape
'character that tells Alpha Five not to take the next character literally, but rather
'consider it part of the string.

y = extract_all_strings(x,"href=\"","\"")

'Loop through the list of URL and identify the ones that
'start with a "/" and put them in a variable called page list
'Those are the internal links. Here we are using the for each function.
'This allows us to loop through the list and examine each line one by one.

dim pagelist as C
dim href as C

for each HREF in y
if substr(href,1,1) = "/"
pagelist = pagelist + HREF+crlf()
end if
next

'Loop through each of the pages in the pagelist variable
'as you loop though, download the contents of each page
'Add just the Title of the page into a variable called titles
' Again we are using another For Each, and we also are using
' the extract_string() function. Extract_string works just like
'the extract_all_strings function we used earlier, except it returns
'just one string. We are using it in this case because we want to extract the
'page title, and there is only one page title.

'Notice that we used "" to tell the function where in the
'string it can find the title. (I.e., look between the title tags).

dim link as c
dim titles as c

for each link in pagelist
x = http_get_page2(url+link)
titles = titles + extract_string(x,"") + crlf()
next

'The final line sets the return value of the function to titles, which is the
'variable that holds all of the page titles.

CrawlAndScrape = titles

END FUNCTION

Summary

    We could have added another loop as well that finds all of the links in the linked pages, and then crawls down them as well. Or we could have checked each link for a 404 or other server error and reported back the results. By taking advantage of the Alpha Five web development tools, and using high level functions like http_get_page2() and extract_all_strings and the For Each loop, you can be build useful web crawling and page scraping scripts very quickly.
The Mobile App Gap. What is going on?
Building Sophisticated Database Applications in Alpha Five: An Inventory Management System

About Author

Default Author Image
Chris Conroy

Chris Conroy runs digital programs for Alpha Software.

Related Posts
Role-Based Security for Business Apps
Role-Based Security for Business Apps
Evaluating Low Code Mobile App Development Platforms
Evaluating Low Code Mobile App Development Platforms
Building Business Apps with Flexible Design
Building Business Apps with Flexible Design

Comment

Subscribe To Blog

Subscribe to Email Updates