Crowdsourcing for GUI testing

Research angle:

Facilitating weekly automated human software testing for $40 per week

Advantages:

Automated
Robustness
Cheap

Human side:

Can software testing exploit crowdsourcing?
10 workers test software for 30minutes (0.5 x $4/hour x 20people = $40)
worker job completion
Worker attention span
do workers read screen instructions?
Returning weekly mturkers?
How much to pay them
Reputation of weekly returning MTurkers (Martha)

Technology:

MTurkers use any browser
MTurkers do not have HTML5 probably
Mouse lag (always latency)
VNC, phpvirtualbox, Flash, JavaScript
Fraud detection
- check if task was completed (downloaded file, etc)
- multiple workers, consistency
crash capture and logging
Multiplatform GUI testing (lot of engineering)

Approach:

Try to get a first MTurk test by 15 June
Test out various browser solutions

GOAL: Webpage with results of 5-10 different tests which are automagically MTurked weekly. Each test is green when all testers reported success, yellow if one MTurker encoutered a problem and red in other cases. Every test is clickable and shows for all MTurkers the complete log files of their work. Plus complete screen capture video of their screen/application activity during test. Tests:

click on network buzz keyword and start download
keyword search without suggest and start download
keyword search suggest after typing "2011 " and start download
Pauze and resume download
subscribe channels
Conduct tests with both empty megacache and 50k items megacache

Additional tasks:

ToDo: family filter disabled, prevent users conduct both A and B test by making it a single HIT on MTurk
Find success rate for various formulations
- "try to find out how to add something to your channel"
- "Locate the channel button and add something to Your Channel"
- "3rd formulation"
- A) try to understand the channel concept in Tribler B) discover where "your channel" is located C) add something to your channel
Find success rate for various formulations
- " try to download a single file from a swarm
- "search for "blue suitcase", goto files tab, select the file "vodo.nfo", click the "download selected only button".
A/B testing. Create two variants of Tribler and test success rate/task completion time.
- Search with and without the bundeling feature
- A: bundeling turned off/disabled
- B: bundeling turned on
- Training search queries: "blue suitcase" (simple:one result), "TED Bill Gates", "big buck bunny"
- A/B search tasks: "Ubuntu 11.04", "Pioneer One" episode 2, "Sintel", "the yes men fix the world",
- Measure task completion time+evolution over queries (from init, till start time of download!), variance within test population, 95percent significance?
- Conclude: inconclusive if this feature is good or not, but we've demonstrated that MTurk can be used for this sort of tasks
- NULL hypothesis: reject that it does not work. Benchmarking against classical method.

GUI usability testing

HYPOTHESIS: Both experienced and novice users of P2P technology don't read anything in the GUI

Tools: task completion time, replay the capture of user mouse-clicks + moves + GUI.

Danger1: task completion time noise: they are doing other background tasks; cancel measurements with non-moving mouse.

Test0: do they understand the search results page
Test1: do they click/understand the frontpage tags
Test2: do they spot the second+third column for bundeling results
Test3: Do they notice with bundeling that the first hit represents a sample? (they don't read the more)