lunes, 12 de mayo de 2014

A computer tool to aid the analysis of Chinese Buddhist texts

From: Michael Radich <Michael.Radich@vuw.ac.nz>
Subject: "tacl": a computer tool to aid the analysis of Chinese Buddhist texts

Dear colleagues,

I write to draw your attention to a piece of software developed by myself and Jamie Norrish (who has done the programming) for the analysis of Chinese Buddhist texts.

The tool, which we call "tacl", is free software (also known as "open source", though we prefer the former term); and therefore, naturally, anyone can download it or modify it as they wish. See:


The basic functionality of the tool, as described below, has a range of potential applications, in the study of such questions as:

-- sources of a given text/corpus;
-- later impact of a given text/corpus (citations, borrowings);
-- stylistic features distinctive to a given author, text, corpus, milieu;
-- implications for dating texts;
-- the investigation and identification of texts of possible Chinese composition (including "apocrphya");

etc. 

At base, the tool is very simple in its conception. It operates on the xml files produced by CBETA. It analyses texts into "n-grams", i.e. strings of contiguous characters (here, individual unicode Chinese characters) of user-defined length. It then allows the user to compare two or more texts or groups of texts ("A", "B", "C"...) to find either:

1) all (verbatim) strings SHARED by BOTH A and B (and C etc.);

or

2) all (verbatim) strings UNIQUE to A against B (and C etc.), or vice versa.

For the purposes of such analysis, the user can define groups of texts of various sizes, ranging in size from a single text to the entire canon. It is also possible for users to edit the root library of texts manually, e.g. to split a single Taisho text into multiple parts.

Results are generated in the form of text (comma-separated values), which can then be further analysed, sorted or manipulated using such tools as Microsoft Excel. The tool incorporates a number of further functions which also allow the user to do such things as:

-- generate counts of particular n-grams in each text;
-- highlight matches in one or more texts inside a base text;
-- filter original raw results in various ways;
-- generate some summary statistics about a set of results;
-- search a corpus for a list of multiple n-grams at one time.

It is also possible to use the results of one round of tests as input into a next round, and thereby, to concatenate multiple tests. This makes it possible to examine more complex questions, such as "What terms are found in Group X, and also in Group Y, but never in Group Z, and appear in Text A?"

It must be borne in mind that in its present form, the tool only generates raw material for further human analysis (which, in my experience so far, can still be laborious and exacting); it is no magic bullet or crystal ball. Its results must be used with care and critical awareness, including thorough consideration of one's own operating hypotheses and underlying assumptions. The tool is also subject to various concrete limitations, such as the fact that it only finds exact verbatim matches, and the (related) fact that it cannot (at the current stage of development) handle variant readings as indicated in the Taisho critical apparatus. These limitations, too, must be carefully considered in analysing the results of its operation. Nonetheless, despite these limitations and caveats, I believe that it can already be a powerful aid to the study of a range of worthwhile problems.

Potential users should be aware that the tool currently operates from the command line, i.e. it has no gui ("graphic user interface"; point-and-click). 

For just one example of work completed with the help of the tool, please see the following recent publication:

Radich, Michael. "On the Sources, Style and Authorship of Chapters of the Synoptic Suvarṇaprabhāsottama-sūtra T664 Ascribed to Paramārtha (Part 1)." Annual Report of The International Research Institute for Advanced Buddhology 17 (2014): 207-244.

In this article, I argue that four chapters of Baogui's 寶貴 synoptic Suvarṇaprabhāsottama-sūtra 合部金光明經 T664 ascribed to Paramārtha 真諦 (499-569) have a range of previously unobserved sources in earlier Chinese translation texts, and were probably composed in large part in China. I further argue for the likelihood that portions of these chapters were composed or revised in a context closer to the early Sui dynasty (581-618). In preparing this study, I used tacl to help uncover extended parallels between Paramārtha's chapters and Chinese source texts, and to gather stylistic evidence (terminology) more characteristic of Sui authors than of Paramārtha.

A follow-up to the above publication should appear next year, but that part of the study deals with Tibetan evidence, in a way that has little to do with the operation of "tacl". Another tacl-based publication, on a problem of a different type, should be forthcoming later this year.

We are still working to actively develop the tool in various directions. However, recent discussions with colleagues have convinced us that the time is probably right to attract the attention of fellow researchers. We hope that in so doing, we can make the power of the tool available to others to aid in new discoveries about our texts; gather suggestions from a wave of early adopters about possible further improvements; and, ideally, persuade others to join us in helping develop the tool further, including the underlying code.

Both Jamie Norrish and myself will very happily entertain email correspondence both off-list or on-list (the latter in my case only, as Jamie is not a member, and naturally, only if editors deem the query of general interest to list members), about all matters relating to both the technical nature and operation of the tool, and its various applications to Buddhological research problems.

Should there be scholars who are interested in the tool, but put off by the technicalities of running it from the command line and other features of its current format, I will also be happy to entertain requests to run tests for the investigation of particular research problems, within the constraints of my available work time. (The computer component of the analysis usually happens quite fast. The biggest constraint on application of the tool so far has proven to be the time available to the person working with the tool, and the fact that to date, the only such person has been myself.) 

I would like to request that if you do download and try the tool, or try it out on research problems, you please let us know about what you are doing, and how you get on. It could be useful to us, in thinking and planning about future development of the tool, to know what others are doing with it, or even just how widely other scholars are interested.

Thank you,

Yours,

Michael Radich
Victoria University of Wellington, New Zealand
________________________________________