On Jun 10, 2020, at 5:26 PM, Michael Hall <mik3hall@gmail.com> wrote:



On Jun 10, 2020, at 3:26 PM, Bill Bane <billfbane@gmail.com> wrote:

Hi -- this little write-up attached may help describe how linear regression
can be performed on nonlinear data.  Of course we need to be careful of
overfitting or extrapolating models using higher-order terms like this, but
for well-contained data sets, it can work satisfactorily.
Cubic_regression_example.pdf
<https://weka.8497.n7.nabble.com/file/t5855/Cubic_regression_example.pdf>  

For reference, here is the synthetic data:
Polynomial.csv <https://weka.8497.n7.nabble.com/file/t5855/Polynomial.csv>  


Fwiw, I posted my current data. 
http://mikehall.pairserver.com/default.csv
http://mikehall.pairserver.com/test.csv

This follows up on something earlier where I thought GraalVM improved Weka memory management but it turned out to just be different gc settings. 
I thought you could come up with a tool to tune gc along the lines of what I had already been doing. Just keep increasing RandomForest iterations until you run out of memory. The settings that allow more iterations might offer improved memory management. Although, I don’t really have anything to prove that would generalize to other classifiers and their parameters. 
I also had the code record information about memory and garbage collection. Either to a csv or arff file.
default.csv is command line invocations with no gc  parameters. It ran out of memory doing RandomForest at about 6000 iterations.
test.csv is the current with different gc parms. It made it to 7000 iterations. 
So this code alone somewhat serves the original purpose it can to some extent indicate how well gc is working. 
However, I noticed that in either case increasing iterations seemed to go along in a very linear way as long as there was free memory. When free memory ran out things got nonlinear as gc tried to manage things on its own. The nonlinear still looked like it might be following a fairly well formed exponential type curve. I wondered if that could be modeled. 
To that end, if interested, you could look at either
X = iteration, Y = elapsed
or…
X = iteration Y = old_count
When things go nonlinear most of the action starts occurring with gc in the old gen memory pool.
You can see in the current dataset I did some extra runs to fill in the nonlinear part a little. The code sorts the instances by iteration to allow for this.
The analysis code removes from the instances all attributes except the X and Y.
It also makes sure the attribute of interest (e.g. elapsed or old_count) is strictly increasing. The last run can get an outofmemory error early. It eliminates instances from the back where increasing isn’t the case. 
The code then tries to determine the nonlinear break. It removes an instance from the back and checks to see if that improves linearity. If it does it adds the instance to a different nonlinear Instances. Repeating until removing doesn’t improve linearity.
Then we have our nonlinear instances ready for modeling.

For a second use I am considering a version that uses increasing sized dataset splits for any given classifier that will handle that dataset. Then see if at some point it goes nonlinear and model the complexity. To get an idea of how well different classifiers scale with increasing data.
If I finish this I mean at some point to put something together that explains this more clearly and looks a little better. 
The visualizations you had were nice. I wasn’t aware you could do some of those with Weka.