5420 – Improving performances HTTP-wise

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5420 - Improving performances HTTP-wise

Summary: Improving performances HTTP-wise

Status:	RESOLVED FIXED

Alias:	None

Product:	mobileOK Basic checker
Classification:	Unclassified
Component:	Java Library (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Sean Owen
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-25 09:16 UTC by Dominique Hazael-Massieux
Modified:	2008-03-26 19:52 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Dominique Hazael-Massieux 2008-01-25 09:16:34 UTC

I think one of the main performance bottlenecks for a real-world usage today is the time needed to complete the HTTP requests on a page that has many links.

At a glance, I think the following improvements could be made:
 * using HEAD instead of GET when we're only interested in the HTTP headers - this could be achieved by adding a parameter to the HTTPResource constructor, I assume
 * parallelizing the requests through threading - it doesn't seem to be the case at this time
 * not requesting several times the same URI; from a simple log.info() testing, it looks like we actually do that currently

Comment 1 Sean Owen 2008-01-25 22:26:40 UTC

> At a glance, I think the following improvements could be made:
>  * using HEAD instead of GET when we're only interested in the HTTP headers -
> this could be achieved by adding a parameter to the HTTPResource constructor, I
> assume

This comes back to the question of whether mobileOK Basic should have
specified using GETs or HEADs for testing. As it is we specified that
the test is supposed to always use GET, since, that may be what
browsers do, and if HEAD fails, well, maybe it's just the server being
dumb.

That said, there's no reason we have to download the body of a GET
response if we don't want to, and I think this is not done where the
body is of no interest (need to check).

>  * parallelizing the requests through threading - it doesn't seem to be the
> case at this time

The HTTPClient bit itself is prepared to handle many requests but yes
I don't think it is called in parallel in our code.

>  * not requesting several times the same URI; from a simple log.info() testing,
> it looks like we actually do that currently

Agree, I think we'll have to concoct some crude facade in front of the
HTTPClient and try to cache retrievals as you say. Shouldn't be too
bad.

Comment 2 Dominique Hazael-Massieux 2008-01-28 09:16:01 UTC

(In reply to comment #1)
> This comes back to the question of whether mobileOK Basic should have
> specified using GETs or HEADs for testing. As it is we specified that
> the test is supposed to always use GET, since, that may be what
> browsers do, and if HEAD fails, well, maybe it's just the server being
> dumb.

Actually, as I have noted before, mobileOK has a specific provision to allow for HEAD requests when link checking:
http://www.w3.org/TR/mobileOK-basic10-tests/#http_request

> That said, there's no reason we have to download the body of a GET
> response if we don't want to, and I think this is not done where the
> body is of no interest (need to check).

If we can do this, that should already help significantly.

(and thanks for the encouraging replies on the other stuff as well)

Comment 3 Sean Owen 2008-03-16 21:03:00 UTC

One point on not requesting the same URI several times -- the best solution seems to be a caching HTTP proxy. I looked into "Smart Cache", a Java-based solution, but found it was not set up to be embedded and had some, shall we say, structural problems.

The way to solve this is probably to run a proxy like Squid 3 on the checker machine, simple as that. Its default configuration looks like it what we need, or something very close to it.

This leaves us with parallelizing requests, and maybe using HEAD.

Comment 4 Dominique Hazael-Massieux 2008-03-17 17:02:08 UTC

The problem with simply relying on an intermediary proxy-cache, is that we only want to use a cache while checking a given page, not across two different checks.

Typically, if a user updates his page (or one of its components) after having checked it in the checker, he would be pretty puzzled if the report from the checker doen't change because the checker used a cached copy instead of the live one.

In the BP checker, I kept a simple in-memory cache of fetched components, so that when I encountered the same URI twice, I would simply get it from memory rather than from the net. Given that we must be keeping a cache of the content somewhere to build the moki document, it would be "just" a matter of keeping track of the URIs. (he said, without having looked at how hard it would be implement it in practice)

Comment 5 Sean Owen 2008-03-26 19:52:11 UTC

Tentatively calling this fixed on the grounds that:

1) On HEAD vs. GET, we don't download the body of a GET request when we don't need it, achieving the goal of using HEAD. The doc does say, to my surprise, that HEAD may be used in one particular test. I remain concerned that there are servers that quite happily respond to GET, but not HEAD, and this makes the test about whether HEAD works.

2) Images, objects, external links are retrieved in parallel now.

3) We don't retrieve the very same URI multiple times, at least not according to the code now. URIs are de-duped using a Set before retrieving them so this should avoid any need for caching, I think.