CDA and Corpus Linguistics: pros and cons of a methodological merging

This is a draft article which I wrote on the potential contributions of Corpus Linguistics (CL) to CDA from a CDA point of view and the pros and cons of such methodological merger.

It needs to be mentioned that parts of this document has been incorporated in a forthcoming article; "A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press" By Paul Baker, Costas Gabrielatos, Majid KhosraviNik, Michal Krzyzanowski, Tony McEnery, and Ruth Wodak.

The interest in incorporating corpus based methodologies in CDA studies is an emerging tendency in CDA studies in recent years. However, the number of CDA work incorporating machine based techniques is disproportionately low (Mautner 2005). In the meantime, some researchers have tried to shed some light on the merits of incorporating machine-based methodology, specifically corpus linguistics, in critical discourse analytical studies (Hardt-Mautner 1995, Mautner 2005, Koller and Mautner 2004, Stubbs 1996, 1997) while traditions in CDA seem to have reservations in taking on board a Corpus Linguistics (henceforth CL) approach for various theoretical, logistic and methodological reasons (Mautner 2005, Caldas-Couldhard, 1993).

The interest in a merger between CL and CDA can be traced in two crucial and yet absolutely distinct parameters. Firstly, with the advent of new technologies and the increasing number of availability of electronic sources all around the world e.g. COUBUILD project in Birmingham University, British National Corpus and UCREL project in Lancaster University along with emerging content material on internet and even search engines as corpus has nurtured a new domain of analysis which can both be considered a necessary and/or advantageous domain of analysis. That is to say that the new changes in the public sphere and penetration of internet in various aspects of everyday life increasingly urges and at the same time luring CDA studies to pay more attention to this domain and its affordances. The shift of public sphere from more traditional domains to online virtual spaces for the past two decades is an automatic call for re-orientation for problem-oriented discourse-focused studies to make the necessary shifts as ‘in a variety of domains -from the intensely personal and local to the public and global- discourse on the web is now a key factor in constructing representations of reality and intertextuality’(2005:821). However, the new emerging re-orientation in CDA is not a challenge-free endeavour due to affordances that the new genre imposes, e.g. ‘dealing with huge sized material’, ‘seamless and elusive quality of the content’, and ‘obscurity of authorship’ (Mautner 2005: 815).

Nevertheless, these new methods of accessing linguistics data bring with them newer approaches to incorporate genre specific qualities of these sources and hence more machine-based tools and techniques are indispensably brought into CDA’s more traditional manual analytical approaches. Moreover, the new relevant developments in technology e.g. availability of electronic archives of newspapers articles in colossal sizes has on the other hand given new salience to (re)emergence of corpus linguistics as a useful and legitimate analytical technique in linguistics.

The other line of interest in emerging interest in incorporation of CL into CDA is consequential to academic debates between CDA scholars and its critics (Stubbs 1997,Hammersely1995, 1997) There are criticism on and around CDA ranging from absolutely ‘applied linguistics’ perspective (Widowson 1995, 1998, 2004) filled with quasi-positivist conceptions ‘scientifism’ that are not willing to see through CDA’s different epistemological and philosophical theoretical standpoints with other lines of methodology oriented linguistics, to attempts by people within CDA perspective to contribute to the methodological rigor of CDA regarding –debatably- the data selection techniques, systematic data sampling and the analytical categories. Amidst these criticisms CDA’s notion of social commitment and its potential emancipatory activism prevents the approach from compromising its ‘sensitivity’ in all levels of study and at times finds those approaches inherently ‘rudimentary’ and shallow for a critical account of the problem under investigation. Similarly Mautner 1995 points out to the inherently holistic approach of CDA and its concern for accounting for discourse/society interface and asserts that ‘its [CDA’s] traditions...does not augur well for integration of computer-aided analysis’ (2005:3) and emphasizes ‘sensitivity’ in CDA studies argues that ‘critical interpretation requires historical knowledge and sensitivity, which can be possessed by human beings but not by machines’ (Fowler 1991: 68, cited in Maunter 1995:3)

CDA endorses ‘criticality’ and ‘context dependency’ -be it co-textual, discoursal or socio-political, as integral aspects of the research-. That is, CDA is not only interested in de-contextualised data per se but it transcends the data/text and examines the process of production and interpretation and the socio-political contextual elements of a text. Further more, CDA is socially committed, it is heavily informed by social theory and looks at discourse/linguistic data as a salient type of social practice and the container and at the same time the mirror of ideologies at work in the society.

Further more, ‘explanatory’ level of analysis and discussion is a core aspect of CDA where the data analysis is contextualised within a specific socio-political context of the society while the same explanatory level is consulted in the data selection procedure. A CDA study is required to account for contextual characteristics in the selection, analysis and conclusions. That is, a CDA type data selection needs to be ‘sensitive’ to the goals of the research and the socio-political context.

On the other hand, a ‘critical’ analysis would not only be interested to see what linguistic elements and processes exist in a text or set of texts but it needs to account why and under what circumstances and consequences the producers of the text have made specific linguistics choices among several other options that a given language may provide. That is, a critical analysis is interested in what is ‘present’ and what is ‘absent’ in the data. This comprises a methodological block of CDA criticism against descriptive, data driven approaches as they are epistemologically inadequate to account of the linguistic choices in process of production of a text and thus miss out a valuable amount of in-depth insights. Mautner 1995 also warns –though sympathetically against the possibility of a machine-based text analysis to turn into such essentialism and simplification and argues that ‘there is a danger, as Stubbs and Gerbig (1993:78) also remind us, of ‘counting what is easy to count’. That is true, in particular of many syntactic phenomena and of discoursal patterning’. (Mautner 1995:23)

However, neither do all these debates mean that any incorporation of machine-based methodologies with CDA approaches towards dada selection and analysis is not feasible nor does it indicate that CDA would not or could not welcome and benefit from the quantitative approaches like corpus linguistics. CL can contribute valuably in rectifying CDA’s procedures in data selection –specifically when dealing with large data (Mautner 1995:1)- and analysis sections provided that the selection is carried on sensitively and does not pose itself as antithesis to CDA’s aims.
A combination of these factors; specific feasibilities of electronic archives, new technological advancement, emerging online space as public sphere, the attempts in systematisation of data selection in CDA and new analytical tools offered by CL are among the motivations for the present project which is a combined and yet independent Corpus linguistic and CDA study of discourses of immigrants, asylum seeker and refugees in British newspapers from 1996 to 2006.

CDA and CL perspectives

The present study has been designed in two strands of CL and CDA working independently towards accounting for representations of RASIM in British newspapers and to see how these two strands can integrate methodologies and insights and benefit from one another.

From a strictly CDA perspective not only quantitative approaches are unsuitable tools for accounting for discursive strategies in discourses of newspapers but also a descriptive data driven approach like CL per se is inherently inarticulate in targeting social problems and relating the linguistic analysis to the social context of language in use. Thus, there is a desperate need for an explanatory level to be added to analytical framework which is informed by social theory. CDA approach toward data and data selection is heavily informed and shaped by the theoretical concepts with an extensive literature available on the social theory and commitment to tackle a social problem e.g. racism, gender inequality. Thus, unlike the emphasis given to “systematic” data collection and randomisation in applied linguistics studies, ‘systematicity’ in data selection is not the most crucial defining factor in the design of a study although it may be desirable. You may see CDA studies on a range of different texts and materials which are not compatible in their forms, contents and genres and different parts of analysis may draw on different types of texts. On the other hand quantitative approaches foreground the data as both the focal point of the research as well as its end product. That is, the analysis starts and finished by ‘describing’ different aspects of the data.

On the other hand, it can be argued that CDA is flexible in moving between theory and methodology with both influencing each other e.g. data analysis may or may not support an existing theory or the theory may call for a new approach in the type of the data and methodology. By comparing CDA with CL analytical approached to discourse analysis one may argue that CDA starts with theory moving to data analysis and ends with establishing the link between the linguistic findings with the social context, CDA has the apparatus to “makes sense” of “why” linguistics finding are the way they are. While a CL analysis seems to start from a theoretical vacuum and finish with some descriptive analysis of the language with some potential conclusions made in numerical terms.

In short, it can be argued that CDA is theory oriented and looks deep in rather limited size texts while CL is methodology oriented and looks at much larger scale data with yet looking at basic linguistic features. Thus, the differences include; a. the depth of the analysis as is the classic difference between qualitative and quantitative approaches and b. having a “critical” or descriptive approach. At the same time there is a major commonality between the two and that is the fact that both CDA and CL work with data.

Data Results (possibly) Theory

Theory Data Results Theory

It is the link between the initial theory and data selection procedure in CDA which has been criticised as being subjective and unsystematic and this is where a corpus based approach can fill in especially with large size data e.g. our project.

This project

In the present project we needed to deal with a colossal archive of news paper articles which would cover all the articles of all British newspapers on or about refugees, asylum seekers and immigrants (RASIM) within a time span of 10 years. This huge size collection of articles could not have been managed without the help of some preliminary CL analysis to help CDA strand to find a rationale for choosing its absolutely limited number of articles.

Within the goals of the present study the CDA part of the project based its first step of data selection and sampling on some preliminary corpus analysis of the whole data where CL was able to spot some spikes in the frequency of the occurrence of RASIM throughout the data and that gave CDA a starting point for data sampling and keeping it as systematic as possible.

Complementary CL/CDA

One major contribution of CL to CDA when dealing with large scale data is the data selection where a CDA analyst may come under the criticism of arbitrariness in selection of texts to be analysed. CL can create a systematic procedure in selection and provide a macro map of the data available as to see what is happening in which part of the data and when so that periods/events can be spotted for CDA to look into and select texts accordingly. In our project, for instance CDA analyses focused on five spikes found by CL where the frequency of occurrences of RASIM was significantly higher than other times/events and this established the first step of a systematic data selection for the CDA strand of the project.

Thus, some general qualities of the data available by CL techniques can always help CDA analysis in strengthening its logics of focusing on specific time or source based on a descriptive mini model of the whole data rather than sampling the data in an ad hoc manner or trying the more classic randomisation (e.g. selecting every other 10 articles) which would lack the required sensitivity of the selection and treats all the data as equally relevant and significant.

Moving on to the more analytical categories, collocation analysis can show how cognitive associations are created and confirmed through concrete linguistic alignment of a social actor e.g. immigrants with their macro structural associations through out the huge data available arriving at different descriptions for different times and events. CDA can deconstruct and investigate how such social cognition is constructed through more “soft” mechanism of argumentation and semantic alignments of propositions and topics. It extensively involves itself with argumentation and tries to establish the role of context and co-text of texts- though in limited numbers- in creating, confirming, or perpetuating certain cognition. CL on the other hand throws light on the concrete linguistic realisations which may construct certain cognitions by examining a huge body of data and come to meaningful collocations of certain linguistics lexical elements and relate it to certain macro structure. This is to say, CDA makes an in-depth diachronic (contextual) and synchronic (co-textual) investigation of limited number of text while CL carries out a descriptive investigation of qualities of texts in a size which is unimaginable and not feasible for a CDA analysis. (Although CDA may not actually need to exceed from certain number of details analyses since the categories and findings seem to become highly repetitive, as CL and CDA results of this project also largely overlap).

One of the short coming of relying on CL techniques is the textual qualities that a CL analysis can analyse. CL approach is mostly “lexical” and is restricted to analysing the arrangements and distributions of ‘words’ in the data. Thus, it proves to be most productive when accounting what CDA calls ‘referential’ strategy which mostly targets how different social actors are named or referred to. The other discursive strategies e.g. predicational one (with more limited scale) and argumentative strategies are widely left out as they mostly function on larger linguistic units than ‘words’. Referential qualities of discourse -how social actors are referred to- are only one of the levels of analysis in a CDA which may not be the strong point of a CDA as it may not be a very opaque technique (although it can be argued that only a CL type of analysis which looks at referential qualities throughout all the data can establish that –for instance- RASIM are always referred to with negative adjective’). On the predicational side also CL can throw some light in terms of what actions and verbs are usually populated in discourses on/about certain social groups but because a CL analysis cannot –or is not meant to- look at the immediate co-text and context it may just results in general descriptive analysis of qualities of verbs and attribution. CL can partly account for co-text when analysing the before and after part of a certain occurrence of RASIM. Hence, a comprehensive analysis of these strategies throughout the data will need to incorporate some CL based techniques while it would require to go beyond that.

As mentioned above CDA’s strong point, however, is not locating and analysing referential strategies per se. It builds on a network of referential, predicational, argumentative strategies along with study of metaphors, presuppositions, mitigation and hyperboles etc and most importantly an amalgamation of these and their interfaces in deconstructing a text. CDA can account for consequential effects of certain co-text and context elements in the process of production and interpretation of a text. Going beyond referential and predicational aspects (which may be considered to be word or sentence level analyses) CDA can also account for argumentative aspect of a text and try to capture the process of cognitive interpretation of a text for a consumer while argumentation analysis is out of the realm of a CL analysis by definition.

The ‘scale’ of analysis is another important factor in CL/CDA merger. While CL can capture qualities of discourse in an immense size and come up with an index of descriptive characteristics of these texts in terms of the forms of the language being used, CDA can account for the ‘meanings’ and meaning construction mechanisms in minimal number of texts and account for abstract aspect of discourse. On the other hand CDA may not be allowed to generalise its findings on a limited number of texts while CL may be useful to examine if such qualities are systematically relevant in larger data or whether such generalisation can be made. Thus, the two strands can be defined as a triangulation in examining the data.

It is also obvious that such a merger between CDA and CL is most useful if the CDA study is followed by a CL analysis and that CL analysis is seen as part of the methodological apparatus of CDA. As such CL will compensate for its apparent lack of conceptualisation and orientations in what kind of analysis needs to be carried out and what the explored results may ‘mean’. Thus, CL is best to be carried out while the CDA analysis is going on or immediately afterwards so that it can be used as a trajectory for the questions raised in the minds of the CDA analyst e.g. to examine if a certain quality is a paramount characteristic through out the data and whether it can be generalised.

Contrary to what may be believed, a CL analysis also similar to that of CDA requires a certain level of “subjectivity” in terms of making decision on the categories to be analysed e.g. shall we look at data in terms of conservative and liberal dichotomy or broad sheet and tabloid or what query words and collocations may be relevant and the like. Thus CDA can bring in the necessary ‘insight’ to CL to create a focal point and a ‘reason’ for investigation. That is, if CL is considered to be a mythological tool for CDA.

As language is a highly dynamic and creative phenomenon the use of language in propagating a certain ideology may change within a short period of time. On the same note, depending on the type of the discourse e.g. type of newspapers under study here, different degrees of ‘linguistic’ realisation may be spotted. While some newspapers may be ‘creative’ in producing an ideology others may be copying more ‘classic’ linguistic forms e.g. famous metaphors of large quantities for immigrants in proliferating that ideology. For example one of the CDA findings of qualities of discourses on RASIM in our project is that broadsheet newspapers avoid the classic metaphors and terms against RASIM while tabloids use them freely. This is also confirmed through CL findings. Thus, when a CL approach tried to look for occurrences of more, ‘classic’ and ‘known’ linguistic terms and metaphors which are believed to be used about RASIM it is inherently oblivious of the more creative mechanisms in quality press where the same or new (negative) categories are perpetuated without making use of the ‘known’ words and collocations. Hence the results of CL on representation of RASIM in newspapers is actually reduced to checking to see how much of everyday stereotypes are recycled and reproduced in the newspapers while ignoring the important fact that discourses are always productive and reproductive. CDA on the other hand is vigilant on both aspects namely discourses that draw on general stereotypes (reproductive) and discourses that work as creating and feeding into those stereotypes (productive). Thus, CDA may find that conservative broadsheets also perpetuate a (negative) ideology against RASIM while CL may not be able to locate and confirm this because its starting point of analysis is the existing notions.


Hey this is very interesting and indeed useful! It would definitely be interesting to see a step-by-step methodology of the specific project, demonstrating where and how CL comes in the CDA analysis and vice versa - as you say they can complement each other in various stages!

Second, about my favourite one -cognition! A scary thing, if you ask me! But since you are mentioning it, you could say a bit more about cognitive patterns when you say collocations are related - apparently they create connections in the mind between the collocates, and/or further reflect mental patterns by putting them in use. Any reference or elaboration in the places where you mention cognition would be good, because in this brief draft the connection with cognition (and conequently with society) is not very clear.

Finally, another nice side-effect of the 'brief draft': The phrase 'there is a major commonality between the two and that is the fact that both CDA and CL work with data.' :D It really doesn't seem to be that much of a similarity!!! :P The point would be clearer if you wrote a bit here juxtaposing CDA and CL with 'armchair linguistics' - indeed there are approaches which do NOT deal with data (inconceivable as it may be for many of us!). So say a bit more about CDA and CL being empirical.

Hope this helps, good luck with the rest of your work and keep writing intresting stuff!!!

A pestered colleague

This is very helpful since it appears to clarify the way in which one can combine both methodologies in their study. I sure hope this can convince my professor to allow to use both in my study.

It was a really useful piece of article and it helped me sort out a way into a research project.

Thank you so much for the article in an open blog.


