Is it possible that you have groups of highly correlated attributes in your data?

I just ran SMOreg with default settings on the abalone data, on three disjoint subsets of the same size. The data has two groups of variables with fairly strong linear correlation. The models are shown below. You can see that the relative importance of diameter vs height is quite different in the three models.

weights (not support vectors):
 +       0.0123 * (normalized) Sex=M
 +       0.009  * (normalized) Sex=F
 -       0.0213 * (normalized) Sex=I
 +       0.0611 * (normalized) Length
 +       0.1574 * (normalized) Diameter
 +       0.2091 * (normalized) Height
 +       0.5624 * (normalized) Whole weight
 -       0.6316 * (normalized) Shucked weight
 -       0.1852 * (normalized) Viscera weight
 +       0.1859 * (normalized) Shell weight
 +       0.0448

weights (not support vectors):
 +       0.0111 * (normalized) Sex=M
 +       0.0039 * (normalized) Sex=F
 -       0.015  * (normalized) Sex=I
 +       0.0985 * (normalized) Length
 +       0.0947 * (normalized) Diameter
 +       0.3764 * (normalized) Height
 +       0.4837 * (normalized) Whole weight
 -       0.6663 * (normalized) Shucked weight
 -       0.1503 * (normalized) Viscera weight
 +       0.4093 * (normalized) Shell weight
 +       0.0689

weights (not support vectors):
 +       0.01   * (normalized) Sex=M
 +       0.0098 * (normalized) Sex=F
 -       0.0199 * (normalized) Sex=I
 +       0.0161 * (normalized) Length
 +       0.2478 * (normalized) Diameter
 +       0.1256 * (normalized) Height
 +       0.5993 * (normalized) Whole weight
 -       0.9519 * (normalized) Shucked weight
 -       0.1521 * (normalized) Viscera weight
 +       0.4655 * (normalized) Shell weight
 +       0.0748


Running your attribute selection set-up on the three subsets (but with a default value of C for SMOreg), I get:

Ranked attributes:
 0.652   8 Shell weight
 0.707   6 Shucked weight
 0.739   3 Diameter
 0.748   5 Whole weight
 0.75    4 Height
 0.753   7 Viscera weight
 0.756   1 Sex
 0.754   2 Length

Ranked attributes:
 0.649   8 Shell weight
 0.702   6 Shucked weight
 0.714   1 Sex
 0.719   3 Diameter
 0.722   5 Whole weight
 0.723   7 Viscera weight
 0.723   2 Length
 0.711   4 Height


Ranked attributes:
 0.611   8 Shell weight
 0.652   6 Shucked weight
 0.679   4 Height
 0.69    1 Sex
 0.694   5 Whole weight
 0.698   7 Viscera weight
 0.699   3 Diameter
 0.698   2 Length

One way to get a more reliable picture of the importance of attributes is to use "Cross-validation" as the "Attribute selection mode" in the attribute selection panel. This will show the average rank of attributes across multiple runs on different subsets of the data.

Considering feature selection with SVMs in particular, you may also want to take a look at SVMAttributeEval, which implements the recursive feature elimination algorithm for ranking attributes.

To compute the correlation for individual predictors, you can use ClassifierAttributeEval.

Cheers,
Eibe

On Mon, Dec 18, 2017 at 10:13 PM, Ronan Flynn <rflynn@ait.ie> wrote:

Hello Eibe,

 

Thanks for your reply. Yes, the three subsets are selected at random from the overall dataset.

 

Regards,

 

Ronan

 

Message: 3

Date: Sat, 16 Dec 2017 18:42:36 +1300

From: Eibe Frank <eibe@waikato.ac.nz>

To: Weka machine learning workbench list.

                <wekalist@list.waikato.ac.nz>

Subject: Re: [Wekalist] Select attributes - why are the rankings

                different forsubsets of the same dataset?

Message-ID: <5a34b259.488f630a.96b50.47a4@mx.google.com>

Content-Type: text/plain; charset="utf-8"

 

Have you shuffled the data before you created the three subsets? The Randomize filter in WEKA can be used for that. Alternatively, you can use the RemoveFolds filter (configuring it for a three-fold cross-validation).

 

Cheers,

Eibe

 

From: Ronan Flynn

Sent: Saturday, 16 December 2017 12:50 AM

To: wekalist@list.waikato.ac.nz

Subject: [Wekalist] Select attributes - why are the rankings different forsubsets of the same dataset?

Hello All,

 

I have a speech dataset that is divided into three subsets. There are approximately 90 attributes and the target is a numerical correlation value. I want to rank the attributes and have used the following:

 

Evaluator:??? weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.functions.SMOreg -F 5 -T 0.01 -R 1 -E CORR-COEFF -- -C 0.0302 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007"

Search:?????? weka.attributeSelection.GreedyStepwise -R -T -1.7976931348623157E308 -N -1 -num-slots 1

 

When I run the attribute selection on each of the three speech subsets I get three very different ranked lists. I would have expected the rankings for the three subsets to be similar given that they are taken from the same overall speech dataset. Can anyone suggest possible reasons as to why the rankings are so different for each of the three speech subsets?

 

Also, is it possible when doing the ranking to output the correlation for each attribute individually? I would like to see the correlation for the individual attributes.

 

Regards and thanks,

 

Ronan Flynn

 

 

 

Tá an t-eolas atá le fáil sa ríomhphost seo faoi iontaoibh agus tá sé ceaptha le haghaidh aird an fhaighteora bheartaithe/na bhfaighteoirí beartaithe amháin. Más rud é go bhfuair tú an ríomhphost seo go hearráideach, ná húsáid agus ná tarchuir é ar mhaithe le haon chuspóir, le do thoil; ina áit sin cuir ar an eolas muid láithreach agus scrios gach cóip den ríomhphost seo ó do chóra(i)s ríomhaireachta. Ach amháin sa chás gur comhaontaíodh a leithéid go sonrach ag ár n-ionadaí údaraithe, is le húdar an ríomhphoist amháin na tuairimí a chuirtear in iúl ann, agus ní léiríonn siad tuairim ná ní chuireann siad ceangal ar aon chaoi eile ar Institiúid Teicneolaíochta Bhaile Átha Luain. Déan teagmháil le administrator@ait.ie nó cuir glao ar 090 6468000. The information contained in this email is confidential and is designated solely for the attention of the intended recipient(s). If you have received this email in error, please do not use or transmit it for any purpose but rather notify us immediately and delete all copies of this email from your computer system(s). Unless otherwise specifically agreed by our authorised representative, the views expressed in this email are those of the author only and shall not represent the view of or otherwise bind Athlone Institute of Technology. Contact administrator@ait.ie or telephone 090 6468000.

_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
List info and subscription status: https://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html