In previous posts, we developed a small web application that runs in the IBM Cloud, then we transformed it into an application that analyses given text and returns tone scoring per sentence. As the app output was rather rude, we also improved its look and feel.
Now let’s consider the situations when someone would like to evaluate the content of a web page. We will further enhance the application to evaluate the content of a web page from a given URL instead of analyzing copy-pasted text. Obviously, after all the hard work we’ve done, it would be great to reuse as much of the application as possible.
Let’s get started!
1. Determine the requirements
The first requirement is to find a library that helps us extract the content from an URL. Fortunately, after a quick search on the internet, the HTML::Extract module from CPAN pops up. The description says “HTML::Extract – Perl extension for getting text and HTML snippets out of HTML pages in general.” Exactly what we were looking for and easy to add to our ‘SourceyBuild.sh’ file which now looks as follows:
# if you want to see what happens in more detail SOURCEY_VERBOSE=1 # if you want to force sourcey to rebuild everything SOURCEY_REBUILD=0 # create a copy of perl #buildPerl 5.22.1 buildPerlModule PSGI buildPerlModule Plack buildPerlModule Plack::Runner buildPerlModule Plack::Request buildPerlModule Plack::Response buildPerlModule Plack::Builder buildPerlModule Plack::Middleware::Static buildPerlModule Data::Dumper buildPerlModule HTML::Template buildPerlModule JSON buildPerlModule Starlet buildPerlModule LWP::UserAgent buildPerlModule LWP::Protocol::https buildPerlModule HTML::Extract
2. Create a new template
Let’s start by creating a new template ‘urltone.tmpl’ using ‘tone.tmpl’ as a reference. At this point, we no longer have a text area to be completed. We replace the text field by an input field requesting an URL and change the wording accordingly. This is what the file looks like:
<TMPL_INCLUDE name="header.tmpl"> <script>$(function () { $('[data-toggle="tooltip"]').tooltip() })</script> <!-- Page Content --> <div class="container"> <header class="jumbotron my-4"> <img src="style/nxp2.png" class="float-right w-25"> <h1 class="display-5">Tone assistant</h1> </header> <TMPL_IF name="alert"> <TMPL_LOOP name="alert"> <div class="container alert alert-<TMPL_VAR NAME='type'>" role="alert"><pre><TMPL_VAR name="msg"></pre></div> </TMPL_LOOP> </TMPL_IF> <div class="my-2"> <TMPL_IF name="tone"> <h4 class="display-7"><TMPL_VAR name="url"></h4> <div class="border border-primary rounded p-3"> <TMPL_VAR name="tone" escape="none"> </div> </TMPL_IF> </div> <div class="my-2"> <form> <div class="form-group"> <label for="textinput">URL to investigate</label> <input type="url" name="url" class="form-control" id="textinput"> </div> <input class="btn btn-primary" type="submit" value="Analyze the tone of this URL"> </form> </div> </div> <!-- /.container --> <TMPL_INCLUDE name="footer.tmpl">
To make the application accessible from all web pages, we add a link to the header file:
<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css" integrity="sha384-JcKb8q3iqJ61gNV9KGb8thSsNjpSL0n8PARn9HuZOnIxN0hoP+VmmDGMN5t9UJ0Z" crossorigin="anonymous"> <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js" integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj" crossorigin="anonymous"></script> <script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.1/dist/umd/popper.min.js" integrity="sha384-9/reFTGAW83EW2RDu2S0VKaIzap3H66lZH81PoYlFhbGU+6BZp6G7niu735Sk7lN" crossorigin="anonymous"></script> <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/js/bootstrap.min.js" integrity="sha384-B4gt1jrGC7Jh4AgTPSdUtOBvfO8shuf57BaghqFfPlYxofvL8/KUEfYiJOMMV+rV" crossorigin="anonymous"></script> <title>Tone assistant | <TMPL_VAR name="title"></title> </head> <body class="<TMPL_VAR name='section'> <TMPL_VAR name='title'> tmpl-<TMPL_VAR name='tmpl'>"> <!-- Navigation --> <nav class="navbar navbar-expand-lg navbar-dark bg-dark fixed-top"> <div class="container"> <div class="collapse navbar-collapse" id="navbarResponsive"> <ul class="navbar-nav ml-auto"> <li class="nav-item active"> <a class="nav-link" href="/">Hello world</a> </li> <li class="nav-item"> <a class="nav-link" href="/tone">Tone assistant</a> </li> <li class="nav-item"> <a class="nav-link" href="/urltone">URL tone state</a> </li> </ul> </div> </div> </nav>
It is obvious that all the hard work comes into play making it easy to uniformly change the header on all pages.
3. Add the appropriate subroutine
Now that the URL of a web page is passed to our application, the coding is fairly straight forward.
To generate the data needed to analyze the tone, we use the library. To complete the task, we will use the same routine as we have previously used to “beautify” the content.
sub urltone { my $env = shift; $env->{req} = Plack::Request->new($env); $env->{param}->{url} = $env->{req}->param('url'); my $p = HTML::Extract->new(); my $doc $p->gethtml($env->{param}->{url}, 'tagclass=body', 'text'); if($doc) { my $result = $ua->post($api_url, 'Content-Type' => 'application/json', Content => encode_json({text => $doc})); if($result->is_success) { $env->{param}->{tone} = beautify($doc, decode_json $result->decoded_content); } else { Page::error($env, $result->status_line); } } else { Page::warning($env, "Nothing to analyze"); } return Page::content($env); }
For completeness sake, it‘s worth mentioning that to get the body content we specify the “tagclass” as being “body”. We also request the content to be returned in plain text format. If additional processing would be needed the default HTML format might be a better option.
4. Fine-tuning the app
As a proof-of-concept, the above code works fine. That said when demonstrating on URL’s “on request” some problems might occur.
Some websites are slow or not available. By default, the timeouts used to retrieve URLs have could outlive the timeout on the web app front-facing proxy. This could result in “backend cannot be reached” or similar message popping up. To prevent this issue from occurring, we set up an alarm that will trigger the routine to continue after 10 seconds. If by that time the data is not fully loaded an error message is generated.
Another problem is that most web pages contain other characters than the traditional ASCII set. As both decode_json and HTML::Extract will return UTF-8 encoded data. The HTML::Template module, however, expects the content to be “Perl internal” encoded.
To complete the puzzle, the Extract library returns a “No candidates found” message if the tagclass body section does not contain any text. It also eliminates spaces which confuse the Tone analyzer so, to better indicate line ending, we will introduce an extra space after any of the ‘.’, ‘?’ and ‘!’ characters..
This results in a more robust routine:
sub urltone { my $env = shift; $env->{req} = Plack::Request->new($env); $env->{param}->{url} = $env->{req}->param('url'); my $p = HTML::Extract->new(); my $doc = ''; eval { local $SIG{ALRM} = sub { die 'Timed Out' }; alarm 10; $doc = $p->gethtml($env->{param}->{url}, 'tagclass=body', 'text'); alarm 0; }; Page::error($env, "Timeout retrieving $env->{param}->{url}") if($@); $doc =~ s/([\.\?!])/\1 /mg; if($doc =~ /No candidates found/) { Page::error($env, "No usefull content found on $env->{param}->{url}"); } elsif($doc) { my $result = $ua->post($api_url, 'Content-Type' => 'application/json', Content => encode_json({text => $doc})); if($result->is_success) { $env->{param}->{tone} = beautify($doc, decode_json $result->decoded_content); utf8::encode $env->{param}->{tone}; } else { Page::error($env, $result->status_line); } } else { Page::warning($env, "Nothing to analyze"); } return Page::content($env); }
Notice that to catch the content of the URL, we start by creating an empty variable ‘$doc’.
To retrieve the content, we sandbox a call to the ‘gethtml’ routine assigning the content of the page to ‘$doc’ as soon as the routine gets the result.
To prevent blocking for more than 10 seconds, we set up a localized alarm function that will crash the sandbox setting the $@ with, in this case, ‘Timed out’. This routine is called to action setting the ‘alarm 10’ seconds. In case the retrieval is executed within 10 seconds, the alarm is cancelled executing ‘alarm 0’ seconds.
5. Make the tone analysis URL accessible in the app
At this point all we need to do is making our new URL accessible in the app.
our $app = builder { enable 'Plack::Middleware::Static', path => qr{.*\.jpg|.*\.png}, root => '.'; mount "/urltone" => \&urltone; mount "/tone" => \&tone; mount "/" => \&main; }
Pushing the app to the IBM Cloud and entering an URL and clicking the Analyze the tone of this URL button, will give the Tone analysis of the document on a sentence by sentence basis and the overarching score of the full document.
Conclusion
The internet is a place filled with exceptions and oddities. Even the simplest proof of concept can run into some unexpected circumstances. Fortunately, with a flexible framework and the right environment is not strenuous to work around these limitations.
Preparation and testing are the key factors in turning frivolous code snippets into robust routines.
To assist in debugging, logging is an essential tool both when developing to get in-depth insights as well as in production to trace incidents and do postmortem analysis.
Leave A Comment
You must be logged in to post a comment.