ToolKit for Large-SCAle studies of Web documents

The project intends to provide an efficient toolkit for evaluating websites.

The evaluation process of Web documents is described as a set of co-operating services. Some services aim to check whether the web documents match some requirements such as given checkpoints from the Web Content Accessibility Guidelines. Other services aim to compute statistics of the results according to different parameters such as page rank, site, organizations, …

The services are composed in a data-driven workflow described in Gwendia. The workflow is executed by Moteur and exploits a distributed computing Grid infrastructure to evaluate multiple services concurrently.

To know more on web accessibility : http://www.w3.org/WAI/eval/preliminary.html

Some References to be completed

  • Some sketches of discussions with W3C members
    • “Yes, it is important to note that these large-scale studies only survey very few checks (the ones that are automatable). There are questions about the actual relation between such studies to the real situation.” o Regarding usefullness, of course, I think it would be nice to have some numbers especially if the study integrate some kind of page rank, or usefulness/popularity, as I imagine tons of inaccessible pages are of no use to anyone, let alone people with disabilities.
    • Generally it is important to have some form of indication for the level of compliance (politicians love numbers and the EC likes to compare the Member States to promote competition). However, one of the main issues is that many of these studies give “zero scores” for websites with any sort of “failure”. For instance, if an entire website has one missing alt-attribute it is deemed as “not compliant”. As a result, many of the studies only show ~3-5% compliance despite the many efforts world wide. This can be daunting and counter-productive rather than motivating.

Students have already worked on this project and produced a workflow to evaluate some features on a set of websites. The project website is here : http://code.google.com/p/tklascaw

This work aims at operationalizing the former study, especially concerning the following points:

  1. Use grid distributed infrastructures to analyze efficiently web sites and possibly multiple sites concurrently
  2. improve control on the process
  3. add new statistics to better evaluate accessibility

The software will be structured in multiple components that can be integrated in workflows, making it possible to easily redefine test procedures on demand at the workflow level.

The main project steps are:

  1. Analyze existing material:
    • Technical items:
      • construction of data-driven workflows such as those used for scientific computing
      • construction of Web Services
      • assembling existing components
      • deployment on a compute grid and related constraints
    • Normative items:
      • W3S norms and web sites accessibility state-of-the art
  2. Deploy a simple web site evaluation workflow
  3. Specify properties and priorities to design an optimal workflow
  4. Operationalization and experiments campaign