^{1}

^{*}

^{1}

^{1}

This paper is intended to present the implementation and testing methodology of Transductive Support Vector Machines (TSVM) proposed by Joachims et al., and extended by Li et al. Initially, Support Vector Machines are explained as optimal classifiers, along with the concept of transductive inference. Along the implementation process, several tests were performed. The data used for such tests was very diverse especially with respect to the dimensionality (number of samples, features, etc.). The ultimate objective was the evaluation of the transductive inference tool in the already developed Intelligent Interface Web Engine from the SISTA group at the Catholic University of Leuven (Belgium).

Este artículo muestra la implementación y metodología de pruebas de Máquinas Transductivas de Soporte Vectorial, propuesta por Joachims et al., y extendida por Li et al. Se explican las Máquinas de Soporte Vectorial como clasificadores óptimos, junto con el concepto de inferencia transductiva. Durante el proceso de implementación, varias pruebas se realizaron. Los datos para estas pruebas fueron diversos, especialmente respecto a la dimensionalidad (número de muestras, características, etc.). El objetivo final fue la evaluación de la herramienta de inferencia transductiva en el Intelligent Interface Web Engine desarrollado por el grupo SISTA de la Universidad Católica de Leuven (Bélgica).

The current society of information is facing a new challenge: Thousands of Megabytes of information are available not only in the public Internet but also in private networks. This information is used continuously, but tools for accessing and manipulation of data do not fulfill the expectations of the users, hence efficient tools are becoming a necessity.

Text classification becomes a key tool in order to deal with such amounts of information. Text classification can be used to organize document databases, filter Spam from e-mail accounts, or even to learn user’s news reading preferences. Search engines for On-line information are a kind of text classifier but they often retrieve results far from perfect.

Most of the results obtained are irrelevant, and in many cases the number of results and their ranking is far from the desired criteria. Techniques derived from Artificial Intelligence and Machine-learning theory have contributed to the improvement of such search engines, leading to better, faster and more accurate results. Kernel methods have become one reliable and robust technique well suited to deal with high dimensional problems, ideal for facing text-mining tasks. More precisely, the Support Vector Machines have been tested in such tasks leading to interesting results. These results have been obtained based on the paradigm of Inductive Inference. This paper contains the results of an application of Support Vector Machines to text classification but now based on the paradigm of Transductive Inference.

The following article is composed of the following sections: Inside the section of methodology we are to going to describe the theoretical background that is necessary to understand the research. The section 3 express experiments and results and discussion of them. Finally in section 4 we describe the different conclusions obtained from the research process.

Text classification in the context of machine learning is a supervised learning approach oriented to create classifiers that automatically classify documents into a fixed number of semantic categories. Since each document can exist in either one, multiple or no category, to facilitate the task, each category is treated as a separate binary classification problem. Every document is a collection of string of characters. Such collection has to be transformed into a representation suitable for learning algorithms and classification. The Stemming tool helps to obtain such representation. The stem of a word is the canonized element of the word. That is the removal of word suffixes such as plurals, tenses, and deflections. In a given engine this stem representation is robust to spelling errors, but have the risk of returning irrelevant items. In order to obtain an attribute-value representation of a text, each word is counted within the text and its attribute is related as a feature in the vector representation.

This attribute is known as time-frequency TF (_{
wi,x
} ). The word _{
wi
} exists _{
vi,j
} is the weight of index term

Where _{
nj
} is the number of documents that contain the index term in their description. For the SISTA database, the model adopted was the tf-idf model ^{[}^{1}^{]}.

The measure used in this project to evaluate the performance for text classification is Precision/Recall ^{[}^{2}^{] [}^{3}^{]}. Precision is the probability that a document predicted to be in class “+” truly belongs to this class. Recall is the probability that a document belonging to class “+” is classified into this class.

Support vector machines were developed by Vladimir Vapnik, based on the principle of Structural Risk Minimization: To find a hypothesis h from the hypothesis space H for which one can guarantee the lowest probability of error, for a given training example. The aim is to find the classifier or discriminator function that maximizes the distance within classes, assuring the lowest probability of error. This is used to classify a set of l training examples

The classification function is given by the expression:

If the data is linearly separable this function will separate the data in a perfect way, and the function (2) will show the following property.

Multiple hyperplanes fulfil this requirement however the aim of the support vectors machines is to construct an optimal hyperplane that maximizes the geometric distance in the close data points, which is called the margin. (¡Error! No se encuentra el origen de la referencia.)

If we want to maximize the decision margin we should minimize the quadratic norm of

Subject to:

Introducing the Lagrange formulation:

Hence, writing our optimization problem in terms of Lagrange we get:

The conditions for the minimum of J in the variables

Replacing and in yields,

From (6) we obtain

Replacing in (8) gives:

This is equivalent to

The main advantage of the Support Vector Machines is that we can solve this problem as a QP problem, the optimization problem could be written in the following form (Here in matrix notation):

Taking into account that in our case x=( our optimization problem in terms of QP is as follows:

The restrictions are as follows,

And G=[0] d=[0].

The method used to train Support Vector Machines belongs to the class of inductive inference methods. In this class of methods, particular examples are used to infer the general concept. The learner induces a decision function with a low error value on the whole distribution of the examples used during the learning phase. But in most of the situations, we don’t care about the particular function; we only need to classify a given set of examples (i.e. test set) with as few errors as possible. The transductive inference [

Inductive inference generally claims for an enough amount of training data in order to determine the decision function. But for many practical uses of text classification, it is crucial to the learner to be able to generalize using little training data. Transductive inference tackles the problem of learning from small training samples [

Some of the applications for the transductive text-classification are the following: Relevance Feedback: Used in most the customizable interfaces of the current RSS sources. After an initial query made by the user, he marks the retrieved documents as relevant or irrelevant. This composes the training set of the application. The remaining documents in the database are the test set. From this information, the learner has to classify the test set as relevant or irrelevant documents to the query. Netnews Filtering: Since there are thousands of topics available in the everyday net news. The user is interested obtaining the relevant information for his profile according to few labelled news selected during previous visits to the newsgroup. Reorganizing a document collection: Nowadays organizations are using databases of documents with classification schemas.

When one new category is created a text classifier must perform a full classification of the entire collection of documents given very few training examples. Observe that all applications have in common little training data but large test sets. More recently they has been implemented in problems with unbalanced data sets [

Section 2 showed that SVMs are well suited for a learning task of the form

The learner L is given a hypothesis space H of functions _{
Strain
} of

Where each training example consists of a document vector and a binary label. In the transductive setting, the learner is also given an i.i.d. sample _{
Stest
} of

The goal of the transductive support vector machine (TSVM) or transductive learner is to select a function _{
Strain
} ,_{
Stest
} ) from the hypothesis space _{
Strain
} and _{
Stest
} such that the expected number of erroneous predictions on the test and the training samples is minimized. But, why not use the learning rule obtained by means of the training examples to classify the test examples? The problem arises when a small training set is used. The classifier will have a “poor” generalization due to the lack of knowledge about the distribution of points in the space

Unlike of the inductive setting, the transductive setting uses the location of the test examples when defining the structure. Such structure corresponds to a structure of possible hypothesis solution. Using prior knowledge about the nature of P(x,y) provides extra information to build an appropriate structure and learn more quickly. The structure is build based on the margin on both the training and the test data. For linearly separable problems, the transductive learning can be achieved by solving the following optimization problem.

OP 1 (Transductive SVM (lin. sep. case)

Minimize over

The solution of this optimization problem is not only the separating hyperplane

Most text-classification problems are linearly separable ^{[}^{2}^{]}, but sometimes the data is non-separable by a linear function, then we can introduce slack variables ^{[}^{2}^{]}.

Where

When we train a transductive SVM we are solving the (partially) combinatorial optimization problem OP2. For a small number of test examples, it could be done manually, trying all possible assignments ^{[}^{2}^{]} is designed to handle large test sets common in text classification with more than 10.000 test examples. It found an approximate solution to the optimization problem OP2 using a form of local search.c.

The key idea of the algorithm is that it begins with a labeling of the test examples done by means of inductive SVM. Then it improves the solution by switching the labels of the test examples such that the cost function decreases. It takes the training and the test examples as input and output the labeling for the test example and a model created by means of the inductive inference.

Besides the two parameters

The following are special properties that every text-classification task have, such as High dimensional input space, Document Vectors are sparse and few irrelevant features ^{[}^{2}^{]}. Joachims argues that TSVMs are well suited and even outperform many of the traditional approaches for text-classification. Intuitively it can be explained by the fact that the transductive learning inherits most of the desirable properties of the inductive learning. TSVMs take advantage of the fact: that words occur in natural language in strong co-occurrence patterns ^{[}^{2}^{]}. Some words are more likely to occur together in a document than others. For example, when we ask a search engine like Google® for the words

Thorsten Joachims has developed SVMlight ^{[}^{6}^{]} that is an implementation of TSVM for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. We evaluated the performance of this application against our version made in MatlabTM ^{[}^{7}^{]}. The MatlabTM version overcomes SVMlight in an example where from a gaussian distribution we drew 100 data points. All the data points were previously labelled into a positive and negative class (50+/50-). The data distribution is linearly separable and the learning machine was trained with 15-labeled examples as training set and 85 as test set. Joachim’s application is oriented to work with large data set with many features rather than with a two-dimensional feature vector. While the MatlabTM version reaches the perfect classification of the test examples with Precision/Recall = 100%/100% (

The advantage of SVMlight is its performance speed on tasks of high dimensional vectors as text classification. This characteristic will be described further.

Given a set of 610 examples that correspond to Reuters’s articles, we have to found those that correspond to “Corporate Acquisitions”. With only 10 examples as training examples labelled 5 as positives and 5 as negatives, the TSVM must classify the remaining 600 test examples. The SVMlight algorithm is trained with this set of examples. Parallel to this we must tune certain parameters. The work with SVM demands to tune certain parameters, then some questions arise: Which kernel we can choose? What criteria for the generalization parameters ^{[}^{2}^{]}. So for this task, we work with a linear kernel. The best value of ^{[}^{2}^{]}.

We let SVMlight calculate ^{[}^{2}^{]}. Finally, SVMlight employs only 40.42 seconds in the evaluation and creation of the model (SVM) and 1 second performing the predictions for the 600 examples. The results show 13 False Positives and 11 False Negatives. This gives us a measurement of Precision/Recall = 95.6667%/96.33%, giving 96% of accuracy on the test set (24/600). In order to evaluate the performance of the MatlabTM version of the TSVM algorithm, we test it with the Reuters task. The algorithm took 26.06 minutes computing and generating predictions for the test data. The results show 19 False Positives and 4 False Negatives. This gives us a measurement of Precision/Recall = 93.667%/98.6667%, giving 96.1667% of accuracy on the test set (23/600). This slightly improvement does not mean any significant advance in terms of results obtained. What is notoriously highlighted is the speed performance of the algorithms: 40.42 Seconds of the SVMlight against 26.06 minutes of the MatlabTM implementation. From this, it is clear that the SVMlight implementation is better suited for text-classification tasks. To automatically define the parameter ^{-1}.^{[}^{2}^{][}^{6}^{]}. With the data of the Reuters task, this function gives a value for

C
Precision
Recall
False P+
False N-
0,05
47,3333
99,6667
158
1
0,10
59,3333
99,6667
122
1
1,00
93,6667
98,6667
19
4
5,00
96,0000
96,6667
12
10
10,00
96,0000
96,6667
12
10
1000,00
0,0000
100,0000
300
0

Note that the best results are reached when the parameter ^{[}^{2}^{]}.

Our MatlabTM implementation of the TSVM algorithm works adequately. The only drawback is response times. Computationally speaking the bottleneck of the TSVM algorithm is the optimization task, which has to be solved with quadratic programs (QP). We obtained worse results working in the different tests of above (toy example included) with the native QP solver of MatlabTM. For this reason, we adopt the QP solver LOQO ^{[}^{8}^{]}. LOQO was developed at the Princeton University NJ, and thanks to that it is pre-compile, the response times were notoriously improved. For instance, the toy example was performed using LOQO in 12.8480 seconds with an average of 0.1810 seconds per iteration.

While using the QP solver of MatlabTM it employed 3 minutes 18 seconds with an average of 12.4980 seconds per iteration. It was the main reason why we adopt LOQO as QP solver in order to perform the Text classification tasks

On

No. Examples
Precision
Recall
Time Employed (Mins)
Avg. time/Iteration (Secs)
No. Iterations
600
93,6667
98,6667
26,0608
15,1812
1.00329
17
300
66,3333
96,3333
6,3590
2,6140
1.00000
17
150
72,3333
91,3333
2,4931
0,4600
1.00000
17
75
79,3333
93,6667
2,1069
0,1000
1.00000
17
50
87,3333
90,3333
2,2259
0,0300
1.00000
17
25
89,6667
90,3333
3,3831
0,0100
1.00000
17

The improvements in the response times can be explained due to the fact that we are performing less computation in the construction of the linear Kernel

Splitting the test examples we can obtain better results in response times, but it means also a sacrifice of precision/recall in the entire task. It can be overcome if we divide considerably the test set. So we have to find the trade-off between precision/recall and response time.

The SISTA database corresponds to the model that describes the ESAT library. This database has been built by the text mining group of SISTA (Research division of the department of Electrical Engineering (ESAT) at the KU Leuven) ^{[}^{1}^{]}. This model includes a feature vector representation of each book as well as various database tables that allow the research process in text-mining.

In order to apply the TSVM algorithm to the SISTA library, the entire collections of books (1429 Titles) are loaded to the MatlabTM environment. In the first test, we worked with the table ix_library_v1 that have the indexed data including stop words ^{[}^{1}^{]}.

For the creation of the training examples 10 records were labelled as positives, and they were selected among the books that have the word “Control” in his title. On the other hand, 11 examples that have the word “Learning” were selected as negatives examples. We then evaluated the TSVM algorithm, looking for the classification of the 1408 test examples based solely on the 21 examples of training and the position of the test examples.

The TSVM algorithm took 3 hours (201.99 minutes) in solving the classification task. This process was run on a computer with Intel Core 2 Duo processor of 2 GHz with 2 GB of RAM. Which make it unreasonable to implement in a Web search engine, where the response times are expected to be minimum.

In order to verify the results, we integrate the tool in the intelligent interface web engine. In such interface, we typed the word “Control” trying to find the books in the test examples that have this word in his title. 213 titles are retrieved (

Evaluating the results of the classifier, the web interface allows searching within the positive classified examples. We performed the same search of above and 204 of the total 213 books were retrieved.

Apparently, 9 books were labelled as false negatives for the classifier. Now, analyzing this book titles we see that the field of “Description Quality” not has any information for 7 of the 9 titles.

A missing quality label means that a review was not found on the internet by the web crawler or because Amazon.com did not have a review available for the book (this is the case for 207 books, e.g. because it is a Dutch book) ^{[}^{1}^{]}.

The other two examples have been scored with “4” meaning that the review is no reliable (see above); in fact, one of the books is in Dutch what excludes it of any possible classification. The remaining book has the title: “Evolutionary Learning Algorithms for Neural Adaptive Control” Indeed this book fall more in the negative class where 10 books with the topic “learning” were selected as training examples.

These results show interesting clusters found within the SISTA library. The above experiment proves that the algorithm performs good classification of the books with this small quantity of examples. The only drawback still being the response time that continues making impractical the application.

No. Example
Positives
Negatives
Diff W.R.T. 1408
Time Employed (Mins)
Avg. time/Block (Secs)
Avg time/ Iteration (Secs)
C/C*
No. Iterations
1408
704
704
0
201,99
201,99
20,51
1
17
500
642
766
288
2,3675
50,93
0,95
1
17
250
680
728
342
0,55
5,44
0,155
1
17
125
768
640
404
0,707
3,68
0,054
1
17

Outstanding changes are depicted in the results: the amount of examples that diverge with the original classification are counting in the “Diff W.R.T. 1408” column. On it we can see that the smaller the group of examples the greatest the difference in the results. The number of iterations as

The same example was run in the SVMlight ^{[}^{6}^{]}. The results show 670 examples classified as positives.

The interesting thing here is that there are 50 examples differently classified by this software. Searching in the classified positives examples for books with the word “Control” in his title, the retrieval results were different. This software only classified 139 examples with the word “Control” as positives examples.

The main advantage of this application is that it takes only 10 minutes in order to solve the problem. But still being impractical to implement in a web search engine. Further analysis on this data becomes difficult because we do not have model examples of classification.

A copy of the code can be found in ^{[}^{9}^{]}.

When we look back at the initial goal of the project, we think we can say that we succeeded in meeting the main requirement. We have implemented a kernel classifier based on the transductive inference methodology. We were able to implement the transductive induction by means of the TSVM algorithm of Joachims ^{[}^{2}^{]}.

The fact of being able to classify all a set with few examples of training makes of this algorithm a solution for the huge task of organising and index the huge amounts of data available in the actual information society.

Thanks to the different types of examples with which we could evaluate the TSVM algorithm (see

One of the main advantages of TSVM is that it does not have to tune many parameters as traditional SVM. The kernel selection is straightforward because “non-linear SVMs do not provides any advantage for text classification using the standard kernels” ^{[}^{2}^{]}. The parameter of regularization “C” is calculated heuristically and taking into account the data of the particular task. This strategy proved to be effective. Finally, the number of examples assigned to the positive class (loop 1of the algorithm) in most of the cases corresponds to the half of the test data.

The only drawback found in the TSVM algorithm was the response time. In order to solve it, the splitting strategy was implemented. This solution solves the problem of the response time but sacrifices precision/recall. This is logical since the algorithm evaluates the totality of the test examples in order to define the final optimal classifier. If we train each of the TSVM on a smaller test set, we are not exploiting the 'Transductiveness' to the full extent. This reflects in a reduced performance.

Working with the SISTA database, the algorithm completed the task after 3 hours, which made impractical the integration done in the already developed Intelligent Interface Web Engine ^{[}^{1}^{]} from the SISTA group. A search engine on Internet demand reasonable response times and a delay of 3 hours is unthinkable in a web system. Even for the most patient of the users.

We recommend for further implementations try to use a parallelized strategy, in order to optimize the multiples core that the Processors technology of nowadays offers. Once the algorithm is parallelized, it can also be evaluated in platforms as MathWorks Cloud which is powered by Amazon EC2 ® ^{[}^{10}^{]}.

Thanks also to Joachims Thorsten for providing the theoretical base for this project. This work was supported by the SISTA Group of the KUL ^{[1]}