Back to Academic Work

5 Methods

5.1 Selection of methods

5.1.1 Introduction

As stated earlier, the aim of this work is to explore the role of metaphor in human-computer interaction and, hopefully, to provide assistance to interface designers in their choice or use of metaphors. According to de Saussure and Barthes' theories, the signification of interface metaphors to the user will exist on many levels. Other research and my own preliminary studies also suggest that interface metaphors play many different roles in the interaction of the user and the system. Metaphor is used as a means of introducing novel concepts to the user but also brings with it conceptual baggage. It can turn users' attention towards different aspects of the system or towards their own purpose in using the system. The metaphor vehicle also introduces concepts which themselves carry many levels of signification, above and beyond the metaphor's immediate support for user.

Each of these aspects of interface metaphor might provide a valuable area for further investigation. This provides a wide range of potential methods taken from many different fields of study, including linguistics, semiotics, psychology and sociology.

5.1.2 Potential methods

Some of the possible research issues and potential methods for their investigation are summarised in the following table. The table is by no means exhaustive but lists some of the principle research methods which might be worth consideration:

Table 5.1: Potential research methods.

Research issue Field of study Research methods
Metaphor as human- computer interaction. HCI, software ergonomics. Experimental psychology.
The role of metaphor in the user's mental model of the system. Cognitive psychology. Experimental psychology, computer modelling.
The role of metaphor in the user's motivation and work effectiveness. Management science. Surveys, interviews, case studies, economic analysis.
The cultural role of interface metaphors. Anthropology. Ethnology, case studies.
Interface metaphors as a social artefact. Sociology, social psychology. Observation, interviews, action research.
The mechanism of metaphor. Linguistics, rhetoric. Rhetorical analysis, grammatical analysis.
The interface metaphor as a sign. Semiotics. Semiotic analysis or de-construction.

Chapters two and three drew on literature from a number of fields which offer possible approaches for this study. As the quotation from Whittock (1990) in Chapter 3.4.1 pointed out, the most obvious approaches to the study of rhetoric are either those based on cognitive psychology or those of 'rhetoric and strategies of communication'. I have carried out limited investigations in both these areas, as described in the previous chapter. The experiment based on cognitive models indicates that metaphors can operate at different layers of signification leading to very different relationships of users to the system and their purpose in using it. However, it would be difficult to extend this experiment to a more rigorous comparison of metaphor classes without a very large number of implementations of many different metaphors which would be well beyond the scope of this thesis. However, this does not rule out other methods of examining the phenomenon of multiple signification more deeply.

Rhetorical analysis is normally used to examine the content of a text and categorise the various tropes and schemes used. Although this was useful in demonstrating the prevalence of metaphor and metonymy in the MS-DOS command language, it does not offer any assistance to the designer nor provide any insight into the human aspects of human-computer interaction. Another method based on a 'strategy of communication' is semiotic analysis, although this is far more subjective, de-constructing a text to draw out its full signification to the reader. It could be questioned whether it is appropriate to use such a subjective method as part of a thesis in the field of HCI.

Burrell and Morgan (1979, p. 6-7) grade research methodologies in a continuum ranging from the nomothetic to the ideographic. Nomothetic methodologies are deductive and objective, characterised by systematic protocol and technique; ideographic methodologies are inductive and subjective, characterised by 'getting inside' situations. According to this categorisation, semiotic analysis is overwhelmingly ideographic. The associated subjectivity is not a fault of semiotics but a central feature, in that the semiotic viewpoint sees all meaning or signification as subjective, as is made clear by the semiotic view of the relationship between the signifier and the signified which takes place entirely within the head of the reader or observer, who 'reads meaning into the text'. The analyst thus has to attempt to 'get inside' the head of a potential user.

Robson (1993, p.18-19) makes a similar distinction between 'scientific' and 'interpretive' approaches. He points out that the former is often described as 'hypothesis testing' and the latter as 'hypothesis generating.' However, he goes on to say that, "many of the differences between the two traditions are in the minds of philosophers and theorists, rather than in the practices of researchers." (Robson 1993, p.20). He quotes Bryman in support of this view:

The suggestion that quantitative research is associated with the testing of theories, whilst qualitative research is associated with the generation of theories, can... be viewed as a convention that has little to do with either the practices of many researchers within the two traditions or the potential of the methods of data collection themselves. (Bryman 1988, p.172)

In the preface to his book, Robson (1993) admits that, as an experimental psychologist, he "started with a virtually unquestioned assumption that rigorous and worthwhile enquiry entailed a laboratory, and the statistical analysis of quantitative data obtained from carefully controlled experiments." However, his interest in real world research demanded approaches which could "say something sensible about such complex, messy, poorly controlled 'field' settings." In his case suitable, though more subjective, methods came from the sociologists and social psychologists he worked with. Semiotic analysis of a user interface, however, does not describe how that interface is viewed by its users in the real world, rather it looks at all the possible ways in which it might be viewed by users. Semiotic analysis could be used to deconstruct the language of users' interaction with their systems but this would be a much more extensive task than analysing the interface and it is questionable whether it would yield as useful results as ethnographic approaches which have been developed explicitly to study such real world interaction.

A fourth approach was proposed in Chapter 3 - that of the 'What for?' interview technique. Such a use of simple open-ended questions is known as 'probing'. It offers a simple technique which designers could employ with their own users and, though it is related to the semiotic method, it is more formalised, leaving less room for the designer's personal bias and obtaining data purely from the user.

5.1.3 Probing

Probing was developed as a technique for use in a particular form of non-directive interview - the focused interview (Robson 1993, p.240-41; Zeisel 1984, p.140). Rubin and Rubin identify three reasons for using probes:

Probes encourage the speaker to keep elaborating. Second, probes ask the interviewee to finish up the particular answer currently being given... The third function of probes is to indicate that the interviewer is paying attention. (Rubin 1995, p.148).

They then identify five types of 'housekeeping' probes: elaboration, continuation, clarification, attention and completion.

They ensure that you are getting a reasonably accurate and understandable answer while encouraging the interviewee to keep talking. But probing does more than keep the conversation going, it helps get the depth and dependability you need. (Rubin 1995, p.150).

Rubin and Rubin also describe steering (p. 208), sequence, experience, evidence and slant probes (p.208-10). These are not relevant to this experiment because, as their names indicate, they are used by the interviewer to steer the interview in a particular direction whereas 'What for?' probes are intentionally non-directive.

Zeisel (1984, p.141-56) provides a more detailed analysis of the types of probe an interviewer might use, categorising them as follows:-

Addition probes to promote flow - used to get respondents to express themselves more fully, e.g. encouragement and body language - "I see" or a nod of the head.

Reflecting probes to achieve non-direction - echoing the respondent or responding to a question by repeating it back.

Transition probes to extend range - moving on to the next issue or expanding an issue that was mentioned but skipped - "that reminds of something you were saying earlier" or "that raises the general issue of..."

Situation probes to encourage specificity, e.g. pointing to a map or a picture to establish precisely what the respondent is talking about.

Emotion probes to increase depth, e.g. "what do feel about this?"

Personal probes to tie in context, e.g. "is there anything particular about you that makes you feel strongly about this subject?" "does that relate to some previous experience you've had?"

The closest of these categories to the type of question used in the 'What for?' interviews is that of reflecting probes. Certainly, the 'What for?' probe supports non-direction. However, in interviews quoted at length, Zeisel includes another category which is even closer - the general probe. For example:

Respondent: I am afraid to live in that area

Interviewer: What are you afraid of? (p.153)

Respondent: I find this office extremely inefficient and wasteful.

Interviewer: In what way? (p.155)

These examples represent the closest category of probe to the type of question I am advocating. Zeisel's book is specifically about the use of interviews to gather respondents' opinions about environmental situations - where they live, work or visit - in order to inform design and planning decisions. This is comparable to using such techniques to get information about an interface the person works with to assist in the design of that interface. The difference between the 'What for?' technique and Zeisel's focused-interviews lies in the role of the interviewer. In his case there is a specific focus towards the design issues, whereas the 'What for?' technique simply attempts to uncover as many layers of meaning as possible; it is for the designer to consider whether these are relevant afterwards.

5.1.4 Research validity

Gill and Johnson (1997, p.128-29) offer a number of criteria by which the validity of a chosen research method might be assessed:

Internal validity. The degree to which the researcher can be sure that the 'cause' is what actually produces the effect.

External validity. The extent to which the research can be generalised. This can be subdivided into the following:

Population validity. The validity of generalising from the research sample to the population in general.

Ecological validity. The validity of generalising from the social context of the research to other contexts and settings.

Reliability. The consistency of the results and the degree to which another researcher would be able to replicate the original research.

As consideration of cause and effect is not relevant to exploratory research (see Section 5.2.1), internal validity will not be considered at this point. However, it is also necessary to consider the practicality of the methods and whether they could yield 'useful' results to help interface designers, giving the following table:

Table 5.2: Validity of research methods.


Method
Populat'n validity Ecological validity Reliability Practical-ity Useful-ness
Rhetorical analysis N/A Low Medium High Low
Semiotic analysis Low Low Very low High Medium
Experimental psychology* (Medium) Low (Medium) (Low) Medium
Probing Medium Medium Medium High High

(*In the case of experimental psychology, there is an inverse relationship between the validity criteria shown in brackets and the practicality of the experiment. As explained above, comparing metaphor categories with moderate population validity and reliability would require a great many experiments, giving a low level of practicality.)

Rhetorical analysis can be excluded as not giving very useful results, while semiotic analysis could give useful results but must be rejected as its external validity is so poor. By contrast, the potential of probing with the 'What for?' technique could be demonstrated by a relatively small number of interviews, with the potential to yield highly useful results of reasonable validity. This approach therefore formed the basis for the research design described below.

5.2 Research design

5.2.1 The purpose of the enquiry

The purpose of this enquiry is not to uncover useful information about specific metaphors or categories of metaphor, but to find out whether 'What for?' interviews offer a potentially useful technique for interface designers to use. Robson (1993, p.42) distinguishes between three principal purposes of enquiry: exploratory, descriptive and explanatory. Of these, investigation of the 'What for?' technique falls into the exploratory category which he characterises as follows:

Robson points out that it is commonly suggested that there is a hierarchical relationship between the research strategy and the purpose of enquiry:

While accepting this as a general rule, Robson points out that it is not absolute - for example, case studies have been used for all three purposes. In considering the 'What for?' technique, some form of case study does indeed appear to be most appropriate. The technique is intended to uncover the higher levels of signification which depend on the context - what the user is using the interface for. This context would change radically in the laboratory where a user would be using the interface to help in an experiment and the signification would be radically different. To test the technique it is therefore necessary to use it in the real world, as close to the conditions in which a designer might use it as possible.

Conventionally, both case studies and surveys examine what is happening in the real world. This is obviously not possible in this case, as the technique is not yet being used. The research must therefore take the form of one or more case studies in which the technique is taken into the real world and applied within it. As such it forms what Robson (1993, p.41) classes as a hybrid strategy, combining aspects of quasi-experiments and case studies.

5.2.2 Interview structure

Uncovering the signification of the interface to the user means any interview must be user-directed - the interviewer must not ask any leading questions. The nature of recursive signification introduced in Chapter 3 implies that the interviewing technique should also be recursive.

The unstructured semiotic analysis of part of the Macintosh user interface, discussed in Chapter 4, section 2, looked at how a user might 'read' an interface. The model above represents a clearer way of encouraging the user to articulate his or her signification:

Technically, there is a difference between these two questions in some circumstances. 'What?' implies an object, action or concept. In the context of the sign then, if a signifier exists (which it must do to ask the question), the object will be the signifier. 'Why?' implies a mechanism, in this case the signification. In practice, the main constraint is the nature of the English language which favours one construction in some cases but not others. For example, it is more meaningful to ask, "What is a spade for?" rather than, "Why is a spade?"; "Why is the sky blue?" rather than, "What is the sky blue for?"

However, this is not an absolute rule in our everyday use of English. For example, we would usually ask, "Why did the chicken cross the road?" rather than, "What did the chicken cross the road for?" In many circumstances, the two questions are interchangeable: "Why did you do that?" is directly equivalent to "What did you do that for?"

These principles form the basis of the interview technique. The first preference is to ask, "What for?" rather than "Why?" There are two reasons for this. A simple one is that the interview must begin with "What for?" because the interviewer does not know the user's initial signified at that point. For example, the interviewer might ask, "What is that for?" but could not ask "Why is that?" At a later stage the interviewer could ask "What do you use that for?" rather than "Why do you use that?" but only once the interviewee has made it clear that he or she does use the interface element referred to.

The second reason for preferring "What for?" is that the user is more likely to be aware of the signified than the signification and to answer questions in those terms. Sometimes the 'What for?' question is difficult to phrase and 'Why?' is easier and carries the same meaning in normal conversation. Asked what an interface element is used for, the user might answer, "to send reports to headquarters." The question "Why is that used?" would receive the same reply. Technically it could be answered, "because it is labelled 'reports'" as this is its immediate signification, but this type of response did not occur in the pilot studies. Whichever form of question was asked, users would answer with what they send reports to headquarters for.

One exception to this is where a interviewee replies that there are many answers to the question. The interviewer can then pause to see whether the interviewee follows up with examples or probe for them with the simple question 'such as?'. Although this may seem to limit the user, it is not proposed that the 'What for?' technique provides information on all possible lines of signification that a user might take. If the interviewee considers that the other significations are important, he or she can return to them later in the questioning.

5.2.3 Pilot study

The main aim of the pilot was to check that the technique would be likely to work and to gain skill in interviewing. Both users and designers were interviewed as it seemed that the technique might raise some interesting contrasts between the signification for the two groups. All of the designers and some of the users were friends or relatives and thus well-known to me and unsuitable as subjects for the main experiment, but adequate to check out the technique and decide whether this direction was worth pursuing. Systems were studied across a range of applications and user environments to see whether this appeared to affect the applicability of the technique.

Interfaces 1 and 2 were both developed within the IT support team for a Local Education Authority (LEA). One interface considered was a statistical reporting system developed in Excel and running on a PC. It is used to account for the placement of special teachers to support children who do not have English as a first language and to report back to the Home Office. The second system runs on an IBM AS/400 and supports a form-based interface used to administer the payment of student grants.

Two interviews were carried out with designers of a manufacturing system. Unfortunately, the company was taken into receivership shortly after the interviews and it was not possible to gain access to users. The system provides feedback on scheduling for advanced manufacturing.

The fifth interface was a Web page set up by a fellow researcher in Brunel whom I interviewed as the developer of the interface. I also interviewed a research manager from a different research centre who had used the web site's diary facility to set up a meeting. At the time, neither researcher was aware of the details of this thesis.

For the pilot study, I carried out the content analysis myself. The bias inherent in this, together with the small number of interviewees, means that the results are not suitable for extensive analysis. The numbers of separate layers of signification uncovered in each interview are shown in the table below:

Table 5.3: Levels of signification in pilot study.

Number of layers of signification
Interface (sector) Designer User
1 (education) 12 13
2 (education) 9 9
3 (manufacturing) 10 N/A
4 (manufacturing) 7 N/A
5 (research) 12 12

As the table shows, similar numbers of layers were uncovered in every interview, across both designers and users and across usage sectors. This may well have been because, in all cases the users were personally known to the designers and, in all but the web interface, the interfaces formed part of bespoke systems designed for those specific users.

A number of interesting features were observed when examining the transcripts. Some interviewees started looping, going back to a previous answer and repeating the explanations given. Where this loop was obviously going to be repeated, I finished the interview. In one case, however, the interviewee backtracked and provided a new set of significations. Successive layers of signification led him to saying that he wanted a good job. When asked what for, he said it was for the money but then backtracked and gave the explanation that he was actually looking for personal fulfilment in his work.

Apart from the branches and loops, most responses started at what appeared to be the simplest levels of signification, such as 'it produces a report', progressing upward to higher motives such 'education is a good thing'. The only exception was the researcher who had used the Web page. Possibly because he was used to looking at why people use systems and how they are structured, he began his responses by saying, "it's a link to another page in Netscape". He then attempted to give an explanation of people's underlying motives for using the Web in general before saying, "That's probably reached the end." He then added, "I've taken your questions in a general sense instead of looking at that particular page but then after all you did point me to that word 'diary'." He then began at the 'bottom' level, explaining why he had used the diary facility, until he reached the level of signification at which he had originally started.

The following quotations from the interviews show the highest levels of signification reached by each of the interviewees. Judgement of which level was the 'highest' was a purely subjective choice on my part:

Table 5.4: Highest levels of signification in pilot study.

Interface Designer's signification User's signification
Education sector
1 Because it is a good idea to educate kids. There are political reasons. Various political issues, concerned with under-achievement of the children
2 It is a good thing that people go to college to study. [The government] have to encourage people to stay in education.
Manufacturing sector
3 Quality of life in terms of earning salary. N/A
4 It is a bad idea to have increased costs or late orders. N/A
Research sector
5 Using my mind and making the best use of my ability. To make me happy. Exploration or interest in the back of my head

There are close similarities between the responses given by the designer and the user of each interface and between people working within the same sector. Again, this could well be because the people concerned work with one another and share a common viewpoint. It should be noted that each interviewee might also see other high level significations which would have been revealed in other interviews. However, it is noteworthy that most interviewees were able to relate the interview to concerns which are well beyond the normal considerations of interface designers, such as politics, morality and personal happiness. Only one interviewee raised a concern that might normally be considered by the designer: to reduce costs.

In summary, the pilot study was similarly effective for all the industry sectors and interfaces considered. Apart from the web page, there appeared to be little difference in the responses given by the designers from those given by the users. It should be noted that all these systems were designed for a small number of users and it should be expected that the designers would be familiar with the users' concerns.

5.3 Implementation

5.3.1 Choice of subjects and interfaces

Although the pilot studies included both designers and users, no useful distinction was found between the two groups. It would also be very difficult to gain access to the designers of interfaces for generic applications, as these are rarely designed by a single individual. As the 'What for?' technique is intended to help the designer gain useful information about the user, the full experiment was limited to this scenario and no interviews were carried out with designers.

No analysis tool can guarantee to yield useful information for all possible analysts in all possible interface design conditions. To establish the potential of the 'What for?' technique, a single interview might be enough to show that the technique could yield useful results. In practice, designers are only likely to use a technique where they consider that the information obtained is worth the time expended. A more useful test would therefore need to establish a 'reasonable case', such as interviewing users from at least two different user groups using different types of interface.

As the experiment involves the assessment of an interface element, this must either be part of an existing interface being assessed, an existing interface due for re-design, or a potential interface being assessed in prototype form. As it is more difficult to gain access to prototypes, an existing interface was chosen for both sets of interviews.

In considering the number of interviews, it is also necessary to consider the conditions in which a designer might use the 'What for?' technique. It is not suggested that the interviews would provide all the information a designer needs but that they should be one of the tools available for user requirements gathering. In practice the designer of an in-house system is constrained in user requirements gathering by the number of people who will use the system. This also formed a constraint on the number of users interviewed in my research. In the pilot study, most of the bespoke systems were used by four or five users, some by only one user and one by 'about twenty'. Whilst generic applications might be used by a much larger number of users, initial studies by the designer are likely to be limited to a similar scale. The results from the pilot study also indicated that this scale of study could yield interesting results.

The pilot study indicated that one factor likely to affect the signification of the interface to the user was whether the interface was part of a bespoke system or part of a generic application. It was more difficult to obtain access to users of bespoke systems but personal contacts were used to gain access to a group of users within a major communications company using an international accounting system. Although a friend provided my introduction, neither the designers of the system nor the users were previously known to me. The second interface chosen was that of Microsoft Word, one of the most widely used generic applications. For ease of access, the second user group was composed of doctoral students taken from the Department of Information Systems and Computing at Brunel University. The researchers were working in a number of areas of computing, principally information modelling. Given that the experiment lies in the field of HCI, researchers in this field were excluded from the user group.

One element was chosen from each interface to form the basis of the investigation. In each case, a frequently used metaphor-based interface element was chosen, although it is likely that the frequent usage had led to the death of the metaphor for both groups. In the case of the accounting system, the chosen element was the 'Navigate' command on the tool bar at the top of all screens, used for changing to a new screen; in the case of Word, it was the 'Save as...' command on the pull-down 'File' menu.

The first interface examined formed part of an international accounting system. The total number of users at the main site was five, all working at the same group of desks in the same room. As the other users of the system were at remote locations, mainly in Australia and the far East, it was not possible to obtain a larger sample than these five. The number of doctoral students interviewed was therefore also set at five to provide a balanced comparison.

5.3.2 Locations and times

The interviews with the users of the accounting system were arranged with the manager of the group to suit their availability - a factor over which I had no control. In the event, they took place in their normal workplace between two o'clock and four o'clock on a Monday afternoon. Their workplace is an open plan office which they share with five other teams, each of five to six people working in related business areas. To avoid disruption to the other workers or the chances of other interviewees over-hearing, the interviews took place in a small meeting room opening off the main office.

Interviews with the doctoral students using Microsoft Word were therefore arranged for the same time on the following Monday afternoon. Nine students were working in the same room, one of whom was known to me and therefore excluded. Of the others, five were immediately available for interview, and formed the user group for the study. The room is located in an attic area and it was possible to interview the researchers in a corner of the room without the other researchers being able to hear or see the activities.

Traditionally, methods such as the NCC (National Computing Centre) systems analysis and design methods stressed the importance of the analyst interviewing users in the users' workplace (NCC 1978, p.106-109). Newer methods claim to be heavily concerned with user understanding but this is expressed in terms of giving training or information to the user (Norman 1986b p.153-238) or of bringing users into the design team (Yeates 1991, p.18-28). This contrasts with the more traditional attitude in which the analyst would gather information from the users, going into the user's workplace to do so.

Whether this is deliberate or the importance of workplace interviews is simply taken for granted, the justification given by the NCC (1978, p.107) appears to remain valid - that interviewing in the workplace "can be an advantage, since the interviewee will feel more at home and additional information can be obtained from observation. Interruptions may tell a lot more than the interview itself." I have assumed that designers should continue to hold such interviews, and that the 'What for?' technique would form a part of them. I therefore carried out all interviews at the users' normal workplace. However, in both groups the interviewees worked very closely together, making it necessary to take each individual subject to a spare desk in one corner of the room or a side room during the interview itself to avoid others over-hearing the responses.

5.3.3 Interview practice

It was important that interviewees answered the questions freely without worrying about their remarks being taken as specifications of the software or complaints about it. It was also important that the interviewees were ignorant of the reasons for the questions (apart from their assistance in my PhD work), in order to avoid attempts at 'correct' answers. Finally, the pilot studies had shown that interviewees were sometimes bothered when they were unable to answer probes towards the end of the interview. I therefore read out the following paragraphs at the start of each interview (adapted slightly for the Microsoft Word users as only their personal anonymity needed to be assured):

I would be grateful for your help in some research I am carrying out for my PhD. I will ask some questions which I would like you to answer as simply and honestly as you can, where possible with a single sentence. Your answers will not be treated as a specification of the software and will only be used for research purposes. Your identity, the identity of the software and of this organisation will remain confidential. When the interview is complete, I will send you my record of the interview which you may correct if you wish to do so.

The questioning technique may seem a little unusual but I will be glad to explain its purpose once the interview is over. The technique is progressive and will probably lead to questions which you feel unable to answer. This is OK: please just say so and I will wrap up the interview.

Once this statement had been read and accepted, the next step of the experimental procedure was to point out the interface element forming the focus of the interview and ask 'what is this for?' The interviewee's response was then asked about in the same manner until the answers formed a closed loop or the interviewee felt that the question was unanswerable. In some cases it was necessary to repeat a question in a slightly different form when the user failed to answer. After the interviews were completed, a transcript was given to each subject to be checked for accuracy.

There was a risk of potential alienation of the interviewees which might reduce their cooperation if they were asked personal questions. Characteristics such as sex and age were therefore assessed by myself to avoid any chance of this happening. In the case of age, this consisted of placing people into the age groups: under 25, 25-35, 35-45, 45-55, over 55. These assessments, together with the other main characteristics of the two sets of interviewees are summarised in the table below:

Table 5.5: Characteristics of user groups.

Group 1 Group 2
Occupation Clerical PhD students
Sex Four female, one male All male
Ages 25-55 25-35
Organisation Large communications company University
Location Open plan office Open plan office
Interface Oracle-based accounting system MS Word (wordprocessor)
Interface element 'Navigate' command on tool bar 'Save as...' pull-down menu command

5.3.4 Choice of personnel

As an experienced interface designer with training and experience in user requirements gathering, it was valid for me to carry out the interviews myself. Although I was biased in hoping the technique would uncover as much useful information as possible, any designer using the technique in the real world would also wish to uncover as much useful information as possible and thus have a similar bias.

In practice, it is probable that the analysis of the interview content would also be carried out by the interviewer but, for this experiment, bias might be introduced if I carried out the content analysis myself. The analysis of the interview data was therefore carried out by independent evaluators. To check on consistency, two evaluators were chosen, one a fellow researcher in HCI, with experience of user interface design, the other a media studies graduate with previous experience of content analysis. Neither was informed of the aim of the experiment beyond what was necessary to train them in the content analysis method.

5.4 Analysis

The interviewees' responses were broken down into elements (see below). These elements were then given to the two independent evaluators for content analysis.

5.4.1 Content analysis methods

Budd et al (1967) and Krippendorf (1980) describe a number of approaches to content analysis, ranging from quantitative analysis of large quantities of data through to qualitative analysis of small amounts of data.

At one extreme is the tightly controlled quantitative approach, such as frequency comparisons of specific words or phrases. This generally requires very large quantities of data for analysis to give statistically significant results.

A second, more common approach is a looser version of this, relying on subjective gathering of words and phrases into categories. For example, the first method might count references to 'press freedom', and perhaps 'freedom of the press', whereas the second approach might also include references to 'journalists' rights', 'censorship' and 'protection of privacy.' Neither technique will normally consider whether the references are favourable or not. For example, a factor might be how much interest the press of a country shows in a particular issue, but not its opinion on that issue.

The third form of content analysis is context sensitive. This relies on a further subjective assessment of whether a subject is mentioned in a favourable, unfavourable or neutral context. This is usually applied when an analysis is made of changes of attitude over time rather than providing an assessment of the actual balance of opinion at a point in time. It might, for example, show increasing support in the press for censorship or increasing support for press freedom. Commonly, it is based less on the statement of explicit opinions than on the form of language used, such as 'gagging order' (derogatory) versus 'privacy protection ruling' (complimentary).

Another form of content analysis specifically covers the analysis of ethnographic studies. I am not carrying out this type of study and did not consider this further. The final category is the one which is apparently the most appropriate for my work: content analysis of interviews or case studies. Unfortunately, this is covered least in the standard books and papers on content analysis.

The principal use of this type of content analysis is in the analysis of psychiatric case studies or similar types of interview. In this context, it is briefly touched on by Chirban (1996) and Gorden (1987). Neither of these authors gives any details on how to use the method which, it appears, is usually a matter of ad hoc design by the experimenter. I therefore returned to the standard content analysis texts of Budd et al (1967) and Krippendorf (1980), adapting their methods to fit the conditions of this study.

Analysis of the 'What for?' interviews is not content analysis in the conventional sense, in that there is no intention to seek pre-defined categories of signification. Content analysis usually depends on proving a particular theory through categories: "No content analysis is better than its categories, for a system or set of categories is, in essence, a conceptual scheme." (Budd 1967, p.39). However, the conceptual scheme in this case is much simpler - that a personal set of categories exists for the individual user and that these are hierarchical. The exact content of a category is not relevant to this, though it could form an interesting area for further research. The hierarchical nature of the categories is more difficult to prove, although it could be argued that it flows automatically from the recursive nature of the interview technique.

Krippendorf (1980, p.75-81) also places emphasis on the definition of categories, stipulating that categories must be defined by both definitions and examples if presented to an untrained observer. If applied to this work it would require the construction of 'extensional lists' (Krippendorf 1980, p.76-77) by which every expression within the text is given a tag to indicate its category. This is the approach taken, with numbered tags added to the response elements by the evaluators as the first stage of the content analysis process.

5.4.2 Structuring the responses

The chosen approach obviously depends on the splitting of the responses into elements to be tagged. Each response was split into sentences and again into clauses. These were further broken down into sub-clauses where a preposition or subjunction had been used which could potentially be used to introduce a new meaning, such as 'to', 'for' or 'and'. In addition, where there was any possibility at all that a separate meaning might have been introduced, the clause was split.

The training for the evaluators, based on responses in the pilot study, included both examples in which elements had to be further split into multiple signification and examples where consecutive elements had to be gathered together into a single signification. However, the breakdown of the responses in the main experiment was deliberately biased towards excessive splitting of the responses. A new element signals the possibility of a new signification but the evaluator can always group elements together, whereas spotting multiple signification within a single element will depend on the evaluator detecting the change in meaning.

Consideration was given as to whether the elements should be presented to the evaluators in random order. However, the context of the elements was necessary to disambiguate them. Consider, for example, the following responses:

Table 5.6: An example of different categories allocated to the same phrase.

Response element Tag
…to bill one part of COMPANY… 8
…to another part of COMPANY. 8
Because if one part of COMPANY is doing work… 9
…or providing services… 9
…to another part of COMPANY… 10

In this example, the phrase 'to another part of COMPANY' was used twice but in different contexts: once to refer to billing and once to refer to the provision of services. It is necessary to present the elements in context to make this distinction clear. In context, the evaluator spotted that the second use of the phrase referred to a separate signification and tagged it with a different number.

Names and other details of the interviewees were removed from the response sheets, although names of the software and the organisations were left unchanged at this stage. Only the responses were included, not the 'What for?' and 'Why?' questions which had prompted them. The response elements were printed out in tabular form for each evaluator, with two additional columns, one for the tags and one for any comments.

5.4.3 The content analysis process

Each evaluator was given a brief training in the content analysis process. The concept of signification was briefly summarised and the evaluators were asked to indicate "wherever a new meaning was introduced." This was done by tagging each response element with a number, starting at '1'. Where two elements carried the same meaning the evaluators were instructed to give them the same number. Where an element contained no signification, such as 'I don't know' or 'the second reason is..', it was to be tagged with a '0'. I worked through one example from the pilot study and each evaluator then practised with another pilot study example under my supervision.

The actual content analysis then consisted of two phases. The first phase was for the evaluators to tag the response elements. Each of them went separately through each set of responses, tagging them with numbers as in the training example. The evaluators were invited to use the comments column to raise any questions or uncertainties but none were entered at this stage.

The second stage of the process consisted of using the tagged comments to indicate common signification between interviewees. The first set of responses was used to indicate the initial set of categories to be used for the analysis. All '0' tagged elements were removed and the remaining answers sorted into numerical order. Although this lost some of the original contextual information, most of it remained. Where elements had been consecutively numbered, they remained in the order of the original responses, as they did when a group of elements were given the same numbers. The contextual ordering was disrupted when interviewees had returned to a previous meaning but, in these cases, ambiguity was reduced by the multiple entries for that tag. In practice, neither evaluator expressed any difficulty in identifying the categories.

The second set of responses was treated in the same manner and the evaluator was then asked to compare it with the first set of responses and mark any duplicated signification across the two interviews. The results of this were then used to combine the two sets of responses into a single sheet of categories. The numbering of the first response set was maintained. Where no duplication of signification was marked, the second set of responses were tagged '1.1', '2.1' and so forth. This allowed the two sets of responses to be merged in numerical order, maintaining the original flow of the responses. The third response set was then marked by the evaluator where it duplicated any meanings contained in the combined set. It was then combined with the combined set in the same manner and the process repeated for sets four and five. An identical process was then carried out for the second user group.

There was one distinction between the two evaluation processes at this stage. Evaluator One (the experienced evaluator) asked for the interview results to be sorted into tag order, whereas Evaluator Two (the interface designer) preferred them to remain in interview order to provide more contextual information. The total process took approximately one and a half hours for Evaluator One and about three hours for Evaluator Two. Only one comment was made at this stage, in regard to the accounting system, where the experienced evaluator marked two elements as equivalent if 'inputting charges' meant the same as 'invoicing'. From the introduction to the system provided by its designer I determined that this was the case and advised her accordingly.

5.4.4 Comparing the results from the two evaluators

The next stage was to check the two sets of analyses for consistency. Robson (1993, p.338-40) compares the appropriateness of various correlation tests. Pearson's correlation coefficient is based on an assumption of normal distribution which cannot be justified for this data. Other measures are the Spearman rank correlation coefficient and Kendall's rank correlation coefficient (Kendall's Tau). Robson (1993, p.340) states that "Kendall's Tau ... deals with ties more consistently" and must therefore be the most suitable for this data. My analysis followed the step-by-step procedure for calculating Kendall's Tau with ties within conditions given in Robson (1973, p.58-59). The results of the tests are summarised in the following table,

Table 5.7: Kendall's rank correlation coefficient.

User Group 1 User Group 2
S N t0.05 ta S N t0.05 ta
1 28 0.25 0.88 1 37 0.25 0.59
2 22 0.29 0.66 2 65 0.25 0.44
3 19 0.33 0.48 3 58 0.25 0.64
4 37 0.25 0.78 4 19 0.33 0.92
5 31 0.25 0.87 5 33 0.25 0.64

Where:

S is the subject,

N is the number of pairs of ratings,

t0.05 is the smallest value of t significant at the 0.05 level for N, and

ta is the calculated value for the two analyses of the subject's responses.

It can be seen that all values of Tau are well above those necessary to indicate significance at the 0.05 level. The two sets of categories can therefore be regarded as closely equivalent. With regard to the second stage, in which the evaluators looked at equivalences between interviews within a group, comparison is more difficult. Although the categories they were using correlated closely, they were not the same. Direct comparison of the two sets of equivalences is not possible without a common set of categories. It was therefore considered whether the two sets of categories could be merged into one.

Unfortunately, combining the two sets of results would depend on subjective judgement on my part or additional information from the evaluators. Consider, for example, the case where one evaluator tagged consecutive elements '2, 3, 3, 3, 4, 4' and the second tagged the same elements '2, 3, 3, 4, 4, 4'. It could be argued that each has identified the same three meanings in the text, merely disagreeing over the precise point in the sentence in which the signification changed: whether between the fourth and fifth elements or between the third and fourth, leading to a combined record of '2, 3, 3, 3/4, 4, 4'. An alternative explanation is that one evaluator spotted the introduction of a new signification in the fourth element while the other spotted a separate distinction in signification between the fourth and fifth, leading to a combined record of '2, 3, 3, 4, 5, 5'.

Without further information from the evaluators, distinguishing between these cases would depend on intuition or guesswork. The discussion of the results in the next chapter will therefore look at both sets of results, bearing in mind the different backgrounds of the two evaluators.

Back to Academic Work