If I had a dime for each 404 Not Found, This page has moved to http://www.new-and-better.com, or simply, This URL does not have a DNS entry message, I'd be sipping rum on a beach in St. Thomas instead of writing this book.
One of the problems people have with the Web is the so-called invalid link. This crops up in search engines where the links are out of date, in user home pages that list a thousand "cool" sites (and the maintainer never checks them more than once), or other sources of external links.
When you add a link to a page maintained by someone else, that person has no idea that you've added a link. Therefore, when the other site moves, its author/administrator can't be expected to notify everyone with such links to inform them of the page's new location.
And because it's not uncommon for a site to change servers, the invalid link message is very prevalent, and for the foreseeable future, it's here to stay.
This chapter presents a Windows client tool that verifies links of a specified URL. Given a URL, this program verifies that all the hyperlinked URLs actually exist on the Web.
So, first let's look at the different types of invalid links that exist.
So far, I've been generically calling "invalid links" all links that don't take us where we want to go. This is because for each time I typed "invalid links," I didn't want to go through this list. The following are the types of invalid links and what could causes them:
Figure 15.1. Unable to Locate Server error in Netscape.
Figure 15.2. File Not Found error.
Figure 15.3. This link has moved.
With these conditions in mind, you can write a simple Visual Basic application that takes a root URL and checks the links on the page. You can also make sure that the referenced images are present.
The usual flow of the application would look like this: The user enters a URL to validate. The app loads that URL into a Sax Webster control (first described in Chapter 4, "Using Web Browser Custom Controls") and parses all of its anchors using the control's GetLinkCount and GetLinkURL methods as described in Chapter 14, "WebSearcher: A Simple Search Tool." Then the Microsoft Internet Control Pack's HTTP client control is used to retrieve the HTTP header information for the anchors (see Chapter 2, "HTTP: How To Speak on the Web," for information on HTTP messages, and Chapter 5, "Retrieving Information From the Web," for information on the HTTP client control).
If you don't have access to the Sax Webster control, there's a short section at the end of this chapter describing how to accomplish the link checking using the Microsoft Internet Control Pack's HTML control (also introduced in Chapter 4). Using the Microsoft control requires a good deal more code so this chapter only describes how to modify the Webster-based code to work with the Microsoft control.
Each response received by the HTTP client control is then checked for the previously listed error conditions, and if present, the URL should be marked as not valid. If a valid HTTP header information response is received for the link, the link is marked as valid. Also, if the Web server associated with the link provides the Last-Modified HTTP response header field, the field's value is displayed in the grid.
The Last-Modified HTTP response message header field specifies the date and time the resource represented by the requested URL was last updated. However, not all HTTP servers provide this field when returning information to HTTP client applications.
You'll also want the user to be able to specify whether to check only links local to that site or all links referenced within the page. You add this functionality via a frame and two option buttons.
The form has a tab control (see Figures 15.4 and 15.5) that allows the user to select either Link View or Web View: the Link View is where the grid is placed, the Web View is where the Webster control is placed. There is also a checkbox on the Web View to enable and disable loading of embedded images by the Webster control. By turning the load images off, the page to be verified will load faster.
The final result of this section is shown in Figure 15.4, which shows the Link View tab, and Figure 15.5, which shows the Web View tab. Most of the controls use the default properties, but a few have their properties modified to meet the needs of this application.
Figure 15.4. Design time view of the Link View tab.
Figure 15.5. Design time view of the Web View tab.
Start a new project in Visual Basic. View the currently available custom controls by selecting the Tools | Custom Controls menu item. This project requires the following controls be included in the list:
All controls except the Webster control and the HTTP client control ship with the Visual Basic Professional Edition. After you add the controls, you must also add a reference to the Microsoft Internet Support Objects. Use the Tools | References menu and select this item in the list. If it is not in the list box, click the Browse button and locate the file NMOCOD.DLL. If you can't find this file on your system, you probably need to re-install the Microsoft Internet Control Pack.
Once the proper controls have been added to the project, you can begin to populate the form. To create the controls directly on the form, follow these steps:
Figure 15.6. The SSTab control's custom properties page.
Now that the form's shell has been created, it's time to add controls to the tabs. Bring the tab back to the Link View by clicking on its tab caption. Refer to Figure 15.4 for control placement. Follow these steps to add the controls for this tab:
You're almost there. Click the Web View tab caption to move to the other tab. To add the controls, refer to Figure 15.5 and follow these steps:
Now that the controls are in place, it's time to start entering some code. The next section covers all the code necessary to make the Link Verifier work.
The task of this application is to retrieve the user-specified URL, gather all of the anchors out of that page, add those anchors to the grid, and then for each URL in the grid, attempt to retrieve the HTTP header information. Finally, the app marks the URL verified or not verified accordingly.
A lot of the code here is also used in Chapter 14, "WebSearcher: A Simple Search Tool." The link checker, as you will see, is a customized version of the Web search tool. There, instead of looking for the invalid link conditions, you look for a user-specified keyword.
The Declarations section contains the following code:
Option Explicit Dim Conn_Done As Integer Dim Grid_Pos as Integer
The Conn_Done variable is a flag used by the HTTP control to signal the end of an HTTP request. The Grid_Pos variable stores the row in the grid where the next URL checked will be inserted.
The AddAnchor subroutine is used to add URLs to the grid control. The routine goes through the grid row by row making sure that the URL to be added doesn't already exist. If the URL doesn't exist in the grid the routine adds it to the grid. The code is shown in Listing 15.1.
Listing 15.1. AddAnchor subroutine.
Sub AddAnchor(sNewAnchor As String) Dim X As Integer For X = 1 To Grid_Pos Grid1.Row = X Grid1.Col = 0 If Grid1.Text = sNewAnchor Then Exit Sub End If Next X Grid1.AddItem sNewAnchor Grid_Pos = Grid_Pos + 1 End Sub
This routine was first introduced in Chapter 5. It is taken from the dsWeb sample that ships with the Dolphin Systems dsSocket control discussed in that chapter. The function is used to parse the host name from a URL. The function depends on the URL being valid. If the URL is invalid, it returns an empty string.
The GetHostFromURL() (Listing 15.2) retrieves the host name from the URL. The host name is the portion of the URL that occurs between the "//" and the first "/" characters. If the "//" is not present, GetHostFromURL() considers the URL to be invalid and returns an empty string.
Listing 15.2. GetHostFromURL() Function
Private Function GetHostFromURL(szURL As String) As String ' parse out the hostname from a valid URL ' the URL should be of the format: http://www.microsoft.com/index.html ' the returned hostname would then be: www.microsoft.com Dim szHost As String Dim lPos% szHost = szURL ' invalid URL If InStr(szHost, "//") = 0 Then GetHostFromURL = "" Exit Function End If szHost = Mid(szHost, InStr(szHost, "//") + 2) lPos% = InStr(szHost, "/") If lPos% = 0 Then GetHostFromURL = szHost Exit Function Else GetHostFromURL = Left(szHost, lPos% - 1) Exit Function End If End Function
The Form_Load event is where you take care of all the startup activity for the application. The code is provided in Listing 15.3.
The grid format is set up, including column widths and captions. Then the local only option button is selected as the default. Finally, the Reset command button's Click event is fired.
Listing 15.3. Form_Load event code.
Private Sub Form_Load() 'Set up grid headers Grid1.Row = 0 Grid1.Col = 0 Grid1.Text = "URL" Grid1.ColWidth(0) = 5000 Grid1.Col = 1 Grid1.Text = "Verified?" Grid1.ColWidth(1) = 900 Grid1.Col = 2 Grid1.Text = "Updated" Grid1.ColWidth(2) = 1200 optLocal(0).Value = True Call cmdMain_Click(0) End Sub
There are two routines that provide some simple user interface functionality. The first, GridClear clears the grid in preparation for a new verification. The second, ToggleControls enables and disables some of the controls on the form based on the pState% flag that is passed as a parameter. The code for both routines is in Listing 15.4.
Listing 15.4. GridClear and ToggleControls Subroutines.
Public Sub GridClear() Dim X For X = 1 To Grid_Pos - 1 Grid1.RemoveItem 1 Next Grid1.Row = 1 For X = 0 To 2 Grid1.Col = X Grid1.Text = "" Next Grid_Pos = 1 End Sub Public Sub ToggleControls(pState%) Frame1.Enabled = pState% cmdVerify.Enabled = pState% Me.MousePointer = IIf(pState%, vbDefault, vbHourglass) If pState% = False Then SSTab1.Tab = 0 SSTab1.Enabled = pState% End Sub
The code behind the text box and the load images checkbox is equally straight forward. The code for these two controls is provided in Listing 15.5.
When the text entered into the txtURL text box changes, the code application sets the Enabled property of the cmdVerify command button to True if there are any characters in the text box. It is set to False otherwise.
The chkImages checkbox merely changes the value of the Webster control's LoadImages property based on whether or not the check box is checked.
Listing 15.5. Code for txtURL and chkImages.
Private Sub txtUrl_Change() cmdVerify.Enabled = (Len(Trim$(txtUrl)) > 0) End Sub Private Sub chkImages_Click() Webster1.LoadImages = chkImages.Value End Sub
For ease of explanation, I put the Exit and Reset buttons in one control array, and left the Verify button on its own. The code for the Click event of the Reset/Exit button control array is found in Listing 15.6.
There's nothing too special about this codeall you want to do is allow the user to clear the results by pressing the Reset button. This clears the form-level variables, as well as the URL textbox and the results grid. It also cancels the Webster control's page load, if one is in progress.
The Exit button simply unloads the form, causing the application to end.
Listing 15.6. The cmdMain_Click event code.
Private Sub cmdMain_Click(Index As Integer) Select Case Index Case 0 'Reset txtUrl.Text = "" Webster1.Cancel GridClear StatusBar1.SimpleText = "Ready..." Case 1 'Exit Unload Me End Select End Sub
When the user enters a URL into txtURL and clicks on the Verify button, it's time for the real action to begin. The cmdVerify_Click event is where the action gets kicked off, as you'll see.
The code for the event is given in Listing 15.7. The first few lines of code clear the grid and disable some of the buttons and the tab control. Then the host name of the machine on which the URL entered resides is extracted from the URL by calling GetHostFromURL(). Next, the status bar caption is updated to reflect the page being loaded.
The Webster control's LoadPage method is used to load the URL to be verified. The Visual Basic Choose() function is used as the switch for a DoEvents loop. This function was discussed in detail in Chapter 14, but basically the value returned is the value from the list provided that corresponds to the integer value of the first parameter. In this case, the value returned will be based on the current value of the Webster control's LoadStatus property each time through the loop. The loop continues until either the URL is completely loaded or an error occurs.
After the loop finishes, the URL is entered into the grid. If an error occurred while loading the URL (LoadStatus >= 5), the Verified? column in the grid is set to No and the routine exits.
If the URL was loaded successfully, the Verified? column in the grid is set to Yes and the routine proceeds to extract all the links from the page. This is accomplished using the Webster control's GetLinkCount and GetLinkURL methods. These methods allow you to iterate through a list of all the links found on the loaded Web page. The code checks to make sure a link is an HTTP link (as opposed to a mailto: or news: link, for example, which aren't accessed using the HTTP protocol and therefore can't be verified by this application). If it is an HTTP link and the user has selected the Local Links only option button (optLocal(0).Value = True), the code further checks to make sure the link is to a URL on the same host as the URL being verified. If it is, the link is added to the grid using AddAnchor. If optLocal(0).Value = False, then the URL is automatically added to the grid.
After all the links on the page being verified are added to the grid, the HTTP client control is used to retrieve the header information for each of the links in the grid. The original URL is also checked again, but this time to retrieve the HHTP header fields for the URL since the Webster control doesn't provide properties for most of them (it does provide properties for the Content-Type and Content-Size headers).
The code loops through each item in the grid, using the variable Grid_Pos as the count of the number of rows in the grid. The Conn_Done variable is used as a flag to indicate that the current header information request has completed. In cmdVerify_Click the flag is set to 0. The flag is set to 1 within the HTTP1_DocOutput event discussed in the next section. The URL is extracted from the grid, the status bar is updated, the URL is assigned to the HTTP control's URL property, and finally the HTTP control's GetDoc method is invoked to retrieve the header information (recall that the HTTP control's Method property is set to 2 (HEAD method) at design time). Another DoEvents loop waits until Conn_Done is set before continuing to the next URL in the grid.
After all the URLs have been processed, the command buttons and tab are enabled once again, allowing the user to enter another URL to verify or to use Webster control to view the URL entered in txtURL.
Listing 15.7. The cmdVerify_Click event code.
Private Sub cmdVerify_Click() Dim lHostName$, i%, URL$, X GridClear ToggleControls False lHostName$ = GetHostFromURL(txtUrl.Text) StatusBar1.SimpleText = "Loading " & txtUrl.Text Webster1.LoadPage txtUrl.Text, False 'wait till the page is loaded While Choose(Webster1.LoadStatus + 1, 0, 1, 1, 1, 1, 0, 0) DoEvents Wend Grid1.Row = 1 Grid1.Col = 0 Grid1.Text = txtUrl.Text 'if an error occurred loading the page, ' add it to the grid and exit If (Webster1.LoadStatus >= 5) Then Grid1.Col = 1 Grid1.Text = "No" ToggleControls True Exit Sub End If 'add this link to the grid as verified: Grid1.Col = 1 Grid1.Text = "Yes" 'now get all of the links on the page: For i% = 0 To Webster1.GetLinkCount("") - 1 URL$ = Webster1.GetLinkURL("", i%) 'is it an HTTP link? If UCase$(Left$(URL$, 4)) = "HTTP" Then 'are we verifying only local links? If optLocal(0).Value = True Then If InStr(UCase$(URL$), "HTTP://" & UCase$(lHostName$)) Then AddAnchor URL$ End If Else AddAnchor URL$ End If End If Next For X = 1 To Grid_Pos Conn_Done = 0 Grid1.Row = X Grid1.Col = 0 StatusBar1.SimpleText = "Loading " & Grid1.Text HTTP1.URL = Grid1.Text HTTP1.GetDoc While Conn_Done = 0 DoEvents Wend Next X ToggleControls True End Sub
The responses to the HEAD request messages generated in cmdVerify (described in the previous section) are handled by the HTTP control's DocOutput and Error events. The code for these events is given in Listing 15.8.
The DocOutput event (described in detail in Chapter 5, "Retrieving Information From the Web"), is fired whenever the HTTP control receives data from the HTTP server it's connected to. This data can be in the form of HTTP header fields such as Content-Type or Server (these are discussed in Chapter 2, "HTTP: How To Speak On The Web") or content data (such as the HTML markup code or an image file). The event is also fired at the start and end of a received message and in the event of an error. The event provides a parameter named DocOutput which is an object containing all the information about the received message. The object's State property indicates the reason that the DocOutput event was fired and is used in a Select Case construct to determine what course of action to take.
The available states are
icDocHeaders
|
HTTP header fields have been received
|
icDocBegin
|
Retrieval started
|
icDocEnd
|
Retrieval ended
|
icDocData
|
Content data is being received
|
icDocError
|
An error has occurred |
Because you're not interested in knowing when the retrieval starts or what the content data looks like (there shouldn't be any content data returned because the request message used the HEAD method), these two states have no code associated with them in Listing 15.8.
The icDocHeaders state is entered whenever all the HTTP response message header fields have been received. The DocOutput object provides a collection aptly named Headers, which contains all the header fields received. The Headers collection has a Count property and an Items collection. There is one entry in the Items collection for each header field received. Each item has a Name and a Value property. You're going to be displaying only the Last-Modified header, so the code loops through all the available header fields (there won't be more than a few). If the Last-Modified header is found, its value is placed in the Updated column of the grid.
The icDocEnd state is entered when the connection with the Web server terminates. If the Conn_Done flag was not previously set by an error condition, the code marks the current URL as verified and sets the Conn_Done flag to signal the end of the verification process for the current URL. Note that even if an error such as URL not located occurs, the icDocEnd state is still entered
The icDocError state is entered whenever an HTTP server returns an error code. The application places the HTTP control's ReplyCode property (the error code received from the HTTP server) into the status bar, marks the current URL as not verified, and marks the end of the verification for this URL by setting Conn_Done to 1.
The Error event is fired whenever an error occurs that causes the HTTP request/response messages to be invalid. This event is handled the same way as the icDocError state discussed in the previous paragraph.
Listing 15.8. The HTTP control's event code.
Private Sub HTTP1_DocOutput(ByVal DocOutput As DocOutput) Dim i% Select Case DocOutput.State Case icDocHeaders With DocOutput.Headers For i% = 1 To .Count If .Item(i%).Name = "Last-Modified" Then Grid1.Col = 2 Grid1.Text = .Item(i%).Value End If Next End With Case icDocBegin Case icDocEnd 'if the done flag is already set, exit: If Conn_Done Then Exit Sub StatusBar1.SimpleText = "Done... " Grid1.Col = 1 Grid1.Text = "Yes" Conn_Done = 1 Case icDocData Case icDocError 'if the URL doesn't exit, we'll get an error... StatusBar1.SimpleText = "Reply Code: " & HTTP1.ReplyCode Grid1.Col = 1 Grid1.Text = "No" Conn_Done = 1 End Select End Sub Private Sub HTTP1_Error(Number As Integer, Description As String, Scode As Long, Source As String, HelpFile As String, HelpContext As Long, CancelDisplay As Boolean) Conn_Done = 1 Grid1.Col = 1 Grid1.Text = "No" End Sub
Now that all the code is entered, it's time to test the application. Either connect to the Internet or start a local Web server then run the application. Enter a URL in the txtURL text box and click the Verify button. You should see the status bar indicate the page being loaded, then the grid fills with all the local links on the page you specified. Finally, all of those links are checked and the status bar is updated as each link is checked.
Figure 15.7 shows the application after it was run against a local Web server using the default Web page (note that no file name is specified in the URL text box, only the server name). I selected All Links in the Links To Verify frame in order to show the two external links as not verified (the machine was not connected to the Internet at the time the verification was performed).
Figure 15.7. Verifying a local server.
Figure 15.8 shows the application after it was run against a server that provides the Last-Modified HTTP header field. I re-sized the columns at runtime in order to display all three columns onscreen.
Figure 15.8. Verifying on a server that provides Last-Modified.
If you don't wish to use the Sax Webster control, or if you're looking for a programming challenge to wind up this book, rewrite the application using the Microsoft Internet Control Pack's HTML client control.
If you have the Webster control and are modifying the project created above, you must remove the Webster control from the new project. For some reason, if both controls are in the project, the Microsoft HTML control is unable to connect to an HTTP server.
The Microsoft control lacks the GetLinkCount() and GetLinkURL() methods provided by the Webster control but makes up for this by providing an event named DoNewElement. If the control's ElemNotification property is set to True, this event is fired for each new HTML element parsed as the page to be verified is loaded. You can check the event's ElemType parameter to determine if the element is a link anchor (in which case ElemType will be A) and if it is, use a modified AddAnchor procedure to add the link to the grid control. AddAnchor must be modified because the HTML control does not resolve relative URLs to the absolute URLs that are necessary for the HTTP control. I'll leave these modifications to you as a code challenge.
Sample code for the DoNewElement event is provided in Listing 15.9. Chapter 4 and the Internet Control Pack's help file describe this event and its parameters in more detail.
Listing 15.9. Sample DoNewElement event code.
Private Sub HTML1_DoNewElement(ByVal ElemType As String, _ ByVal EndTag As Boolean, ByVal Attrs As HTMLAttrs, _ ByVal Text As String, EnableDefault As Boolean) Dim i% 'is this a link anchor? If UCase$(ElemType) = "A" Then 'yes, find the HREF element: For i% = 1 To Attrs.Count If UCase$(Attrs.Item(i%).Name) = "HREF" Then AddAnchor Attrs.Item(i%).Value End If Next End If End Sub
The UCase$() functions are used in the code above because the HTML control does not modify the case of the HTML tags as they are read from the HTML file. If an element's tag was placed in the file as lower case, the ElemType parameter is lower case as well.
You will also have to modify the code for cmdVerify_Click to use the HTML control to load the initial page. You should use the Conn_Done flag to signal the end of the page load by placing Conn_Done = 1 in the HTML control's EndRetrieval event and Conn_Done = -1 in the control's Error and Timeout events. Sample code for cmdVerify_Click is provided in Listing 15.10.
Listing 15.9. Sample DoNewElement event code.
Private Sub cmdVerify_Click() Dim lHostName$, i%, URL$, X GridClear ToggleControls False lHostName$ = GetHostFromURL(txtUrl.Text) StatusBar1.SimpleText = "Loading " & txtUrl.Text Conn_Done = 0 HTML1.ElemNotification = True HTML1.RequestDoc txtURL.Text 'wait till the page is loaded While Conn_Done = 0 DoEvents Wend Grid1.Row = 1 Grid1.Col = 0 Grid1.Text = txtUrl.Text 'if an error occurred loading the page, ' add it to the grid and exit If (Conn_Done = -1) Then Grid1.Col = 1 Grid1.Text = "No" ToggleControls True Exit Sub End If 'add this link to the grid as verified: Grid1.Col = 1 Grid1.Text = "Yes" 'now check all the links in the grid For X = 1 To Grid_Pos Conn_Done = 0 Grid1.Row = X Grid1.Col = 0 StatusBar1.SimpleText = "Loading " & Grid1.Text HTTP1.URL = Grid1.Text HTTP1.GetDoc While Conn_Done = 0 DoEvents Wend Next X ToggleControls True End Sub
The last major change you'll have to make is to correct the code for the chkImages_Click event. The HTML control uses a property named DeferRetrieval to indicate whether embedded documents should be loaded by the control. Change the line of code for this event to read
HTML1.DeferRetrieval = (chkImages.Value = 0)
You'll also have to modify other code that references Webster1 to reference HTML1 (or whatever name you give the HTML control). Note that the HTML control does support the Cancel method so in the cmdMain_Click event code you simply replace Webster1.Cancel with HTML1.Cancel.
The code for the HTTP control's events can be left in tact as long as the link URL resolution is handled in AddAnchor as described above.
Being the last chapter in the book, this chapter was designed to incorporate information from several of the previous chapters. If you hadn't done so already, hopefully this chapter prompted you to read some of the earlier chapters. Probably the most important chapter to help you grasp this chapter is Chapter 2, "HTTP: How To Speak On The Web," which discusses the HTTP protocol and HTTP header fields in detail.
The book concludes with several appendixes that discuss how to create HTML files (Appendix A, "HTML Reference"), Microsoft's new VB Script programming language (Appendix B, "Visual Basic Script Reference"), and programming Win/CGI application for the Microsoft Internet Information Server (Appendix C, "Win/CGI on the Microsoft Internet Information Server"). The final appendix, Appendix D, "Bibliography and Cool Web Sites," provides a good listing of resources both on and off the 'Net to assist you in your Web programming endeavors.