I have implemented a Bayesian network with the Weka API. I have tested it
with the iris test data that come with Weka, and it works well.
The goal of my project is to feed this network data, and predict if this
is either valid or non-valid. I don't have access to any real data yet, so I
am using a script to generate about 100 000 entries. I am using JSON
and the JSON loader included in Weka, and this loads without any issues.
The data structure is basically:
someData1, someData2, someData3 and class.
The data is numeric, and the class is a nominal, with possible labels:
"valid" and "non-valid".
What I want to return from predictions, is the probability that a data set
is considered valid. I do this by returning the probability distribution
of the "valid" label. Up to this point, all the technical stuff works, I do
get a probability out of the system, but here is the issue:
This probability is either very close to 1 (as in >0,99), or very close
to 0 (as in <0,001).
What I want is a more realistic distribution, so that the probabilities
can be anywhere in the range of 0 and 1.
Does anyone have a suggestion how I can improve the prediction in
The data generation code can be found here:
NB: the data generated in by the code above is later pasted in a JSON
file that includes the header section required.
Jan M. Ørstavik