I seem to have slowed down after my initial daily and even twice daily blog posts. So now I'll try to summarize what I've been doing for the past week.

I'm pretty sure I spent most of the time simulating perfect Gaussians and then running Nmix on the data to find the single component confidence level. I then repeated this 100 times and found the average value. I then repeated the whole process for a different number of stars to see just how many stars are necessary for Nmix to be reasonably accurate. The results weren't too satisfying, as it seems that we need around 175 for Nmix to have 70% or so single component confidence.

This is unfortunate, as most of the samples that I have seen only have 20-60 stars, way under the limit wherein Nmix can actually determine that there is only a single component.

I then repeated this, adding in some randomly generated error to the simulated Gaussian, to see if this changed its ability to detect shape. This had practically no effect on the Nmix results, which makes some sort of sense.

On Wednesday, Marla Geha was visiting Beth and she sent me the velocity distributions for 8 other dwarf galaxies (after several reminders). I'll do something with them once I finish with the simulated data.

My newest project was to simulate a sample composed of two Gaussians to see if Nmix actually works the way it should in detecting multiple structures. I placed the two Gaussians one standard deviation apart. This time it was even less accurate than with the single Gaussian at finding the correct structure. I managed to stupidly delete a bunch of that data, but there was only about 40% confidence for 2 components with under 200 stars. With 1000 stars there was on average about 70% confidence, which indicates that it did work. But we don't have nearly that many stars in our samples so it's pretty discouraging.

I also started testing the kurtosis of various sampled data and the actual samples and will be attempting to figure out how to effectively use that in the near future (or the next time the power doesn't randomly shut off in the middle of my program).

To Do:

1) Add kurtosis to double Gaussian test

2) Read Simon and Geha 2007

3) Do the ratio stuff for double Gaussian

## Monday, June 29, 2009

## Monday, June 22, 2009

### Jackknife has too many k's in a row for one word.

Since my last blog post, I have spent my time modifying my bootstrap code to instead do jackknifes.

Jackknifes are somewhat similar to bootstraps in that they test modified data sets, but the jackknife alteration is simply to remove a single data point, test the statistic in question, then put it back and remove another, and repeat until all of the data points have been removed. This gives a good estimate of the variance of the data, because it can show how much a single data point affects the results.

I'll find out more about jackknifing and bootstrapping once I actually get that book from the Astro library (probably tomorrow).

My jackknifing code made a bunch of pretty plots showing the Nmix probability of a certain component for each test. It's pretty interesting that the single-component probability seems to fluctuate a lot for the different jackknifes but quickly converges at 2 components and beyond.

But an alarming discovery was that the simulated normal sample with added uncertainty sometimes appeared to be strongly bimodal. So I decided to test the power of Nmix with a bunch of simulated pure normal populations of varying star number.

Theoretically Nmix should work better with more stars (as more data points make the Normal curve much more visible), so I'll see if that pans out as expected.

To Do:

1) Find the mean and standard deviation of the kurtosis of the jackknifed samples

2) LEARN

3) According to Wikipedia, there is also a kick called a jackknife:

4) Figure out how to do that.

Jackknifes are somewhat similar to bootstraps in that they test modified data sets, but the jackknife alteration is simply to remove a single data point, test the statistic in question, then put it back and remove another, and repeat until all of the data points have been removed. This gives a good estimate of the variance of the data, because it can show how much a single data point affects the results.

I'll find out more about jackknifing and bootstrapping once I actually get that book from the Astro library (probably tomorrow).

My jackknifing code made a bunch of pretty plots showing the Nmix probability of a certain component for each test. It's pretty interesting that the single-component probability seems to fluctuate a lot for the different jackknifes but quickly converges at 2 components and beyond.

But an alarming discovery was that the simulated normal sample with added uncertainty sometimes appeared to be strongly bimodal. So I decided to test the power of Nmix with a bunch of simulated pure normal populations of varying star number.

Theoretically Nmix should work better with more stars (as more data points make the Normal curve much more visible), so I'll see if that pans out as expected.

To Do:

1) Find the mean and standard deviation of the kurtosis of the jackknifed samples

2) LEARN

3) According to Wikipedia, there is also a kick called a jackknife:

4) Figure out how to do that.

## Thursday, June 18, 2009

### Rage Against the Machine.

Well the first thing that I learned today was that to do a good bootstrapping estimation, you must bootstrap n*(log n)^2 times. Then you can find a good estimate of the true population variance as well as the variance of the statistic in question.

Apparently the book I was reading about bootstrapping online is in the Astro Library. And the print version hopefully won't have the Google Book Preview's random missing pages. So I'll head over there and check it out as soon as it stops raining, which is starting to seem like never.

The rest of the day was spent obnoxiously debugging my code, which randomly refused to make a FITS file of the data I wanted.

Apparently the book I was reading about bootstrapping online is in the Astro Library. And the print version hopefully won't have the Google Book Preview's random missing pages. So I'll head over there and check it out as soon as it stops raining, which is starting to seem like never.

The rest of the day was spent obnoxiously debugging my code, which randomly refused to make a FITS file of the data I wanted.

## Wednesday, June 17, 2009

### Giving Bootes II the Boot

Yesterday I slightly refined my code to also create a summary .pe file. This file also includes the k's, but includes the weights and means and variances of the predicted components. Thus when it detects two populations it is pretty easy to see if the second Gaussian has a small fraction of the total number of stars. If it does, it's probably caused by an outlier and shouldn't be viewed as significant.

Most of today was spent attempting to extract the actual Bootes II data from the data set. This was complicated by the fact that (as I realized after some time) some of the data points were duplicates, and some were just ridiculous and shouldn't even be counted. But with Beth's help I was finally able to narrow the 243 stars down to the 21 that are actually supposed to be in the galaxy. I then did my typical bootstrap/Nmix analysis.

But first I ran Nmix on the actual data, with the following results:

1) 21.3% confidence in 1 component, with mean of -122.9

2) 18.9% confidence in a 2 component fit, with means of -128.5 and -114.58

3) 14.2% confidence in 3 components

4) 10.4% confidence in 4 components

So these results are rather ambiguous, with no really obvious number of components. But that is probably a result of the small sample size of 21 stars. This is the least confident of all of the samples I have tested so far.

1) Only 15% had above 10% confidence for a single population, 5% above 20%

2) 25% of both the 2 and 3 component fits had a confidence above 10%

3) 10% had a 2 component fit with confidence above 20%, 5% of the 3's

4) However, in one of the 2 component best-fits where there was no chance of a single peak, 83% were in one of the peaks. Not sure where to make the cutoff that says the other peak is insignificant.

Overall, this was ridiculously inconclusive, but 2 components seems to win out.

To Do:

1) Learn more about bootstrapping/jack-knifing

2) Make a code that will read the Nmix data, then determine if the multi-component fit is significant or not based on the relative weights

3) Kurtosis

Most of today was spent attempting to extract the actual Bootes II data from the data set. This was complicated by the fact that (as I realized after some time) some of the data points were duplicates, and some were just ridiculous and shouldn't even be counted. But with Beth's help I was finally able to narrow the 243 stars down to the 21 that are actually supposed to be in the galaxy. I then did my typical bootstrap/Nmix analysis.

But first I ran Nmix on the actual data, with the following results:

**Bootes II Nmix Results**1) 21.3% confidence in 1 component, with mean of -122.9

2) 18.9% confidence in a 2 component fit, with means of -128.5 and -114.58

3) 14.2% confidence in 3 components

4) 10.4% confidence in 4 components

So these results are rather ambiguous, with no really obvious number of components. But that is probably a result of the small sample size of 21 stars. This is the least confident of all of the samples I have tested so far.

**Bootes II Bootstrap Nmix Results**1) Only 15% had above 10% confidence for a single population, 5% above 20%

2) 25% of both the 2 and 3 component fits had a confidence above 10%

3) 10% had a 2 component fit with confidence above 20%, 5% of the 3's

4) However, in one of the 2 component best-fits where there was no chance of a single peak, 83% were in one of the peaks. Not sure where to make the cutoff that says the other peak is insignificant.

Overall, this was ridiculously inconclusive, but 2 components seems to win out.

To Do:

1) Learn more about bootstrapping/jack-knifing

2) Make a code that will read the Nmix data, then determine if the multi-component fit is significant or not based on the relative weights

3) Kurtosis

## Monday, June 15, 2009

### Monday Monday Monday

Today I ran Nmix 20 times on bootstrapped Segue I data.

But again the results were rather discouraging. Running Nmix on the full data set yielded a single population probability of 70%. Yet the bootstrapped samples never came close to that probability, having an average single population probability of 10%:

7/20, 35% had a single population prob. of over 10%

Only 3 of those, 15%, had a prob. of over 20%

Yet the result was still way better than the others, as only 7, 35% had almost no discernable single or double component structure with over 10% prob. And only 3 of those had a probability of 3 components less than 10%.

So 17/20 had at least a decent confidence for up to 3 components, even if the results weren't as overwhelmingly single population as with the actual data.

Next I generated a normally distributed sample of velocities based on the Segue I data and bootstrapped and Nmixed it as I have with the real data sets. The results seemed to show that Nmix usually works pretty well. The average confidence level for 1 component was 40%, which is way higher than that for the actual data. This of course makes sense, as it is a true Gaussian distribution and should have a high prob. of a single component. But there were still 4/20, or 20% of the time when it didn't seem to detect the single group.

I then added in the uncertainties (excluding the data point with the ridiculous 147.99 km/s uncertainty) and bootstrapped and Nmixed once more. This time the average confidence level for 1 component was only 22%, yet that is still way higher than what it was for the actual data. Yet it missed the single population (having a confidence of over 10%) 60% of the time. But except for one case (5% of the time) it always detected a 2-component fit with a high probability if the one-component fit didn't work. This really makes me question the effectiveness of Nmix, as its results are either spot-on or a complete miss depending on the bootstrapped sample.

Now looking at the graphs, the times when it detected 2 groups were when there was one point (or 2 points if that point happened to be resampled in the bootstrap) that happened to be located really far away due to its uncertainty. Far enough away to not be considered part of the single population. So I guess Nmix did it's job as well as it could given those outliers.

Measurement uncertainty sucks.

But again the results were rather discouraging. Running Nmix on the full data set yielded a single population probability of 70%. Yet the bootstrapped samples never came close to that probability, having an average single population probability of 10%:

7/20, 35% had a single population prob. of over 10%

Only 3 of those, 15%, had a prob. of over 20%

Yet the result was still way better than the others, as only 7, 35% had almost no discernable single or double component structure with over 10% prob. And only 3 of those had a probability of 3 components less than 10%.

So 17/20 had at least a decent confidence for up to 3 components, even if the results weren't as overwhelmingly single population as with the actual data.

Next I generated a normally distributed sample of velocities based on the Segue I data and bootstrapped and Nmixed it as I have with the real data sets. The results seemed to show that Nmix usually works pretty well. The average confidence level for 1 component was 40%, which is way higher than that for the actual data. This of course makes sense, as it is a true Gaussian distribution and should have a high prob. of a single component. But there were still 4/20, or 20% of the time when it didn't seem to detect the single group.

**Normalized Data with Added Uncertainty**I then added in the uncertainties (excluding the data point with the ridiculous 147.99 km/s uncertainty) and bootstrapped and Nmixed once more. This time the average confidence level for 1 component was only 22%, yet that is still way higher than what it was for the actual data. Yet it missed the single population (having a confidence of over 10%) 60% of the time. But except for one case (5% of the time) it always detected a 2-component fit with a high probability if the one-component fit didn't work. This really makes me question the effectiveness of Nmix, as its results are either spot-on or a complete miss depending on the bootstrapped sample.

Now looking at the graphs, the times when it detected 2 groups were when there was one point (or 2 points if that point happened to be resampled in the bootstrap) that happened to be located really far away due to its uncertainty. Far enough away to not be considered part of the single population. So I guess Nmix did it's job as well as it could given those outliers.

Measurement uncertainty sucks.

## Friday, June 12, 2009

### These Bootstraps Are Made For Walking

Today I ran Nmix (100000 sweeps, burn-in of 100000) on a bunch of bootstrapped samples of the Willman I velocity data, this time only including the 40 with lower metallicities.

I'll summarize the results of the 12 trials that I just recorded:

1) This one had almost no discernible structure (~90% for more than 2 components)

2) Again, no discernible structure (~90% for more than 2 components)

3) No structure (~96% for more than 2 components)

4) About 10% confidence for 1 component, 75% for more than 2

5) ABSOLUTELY ZERO STRUCTURE

6) Lotta structure. 21% for 1 component, 55% more than 2

7) Reasonably high confidence for 1 component: 46%, ~30% more than 2

8) No structure

9) Around 10% for 1 component, about 75% more than 2

10) 13% for 1 component, 20% for 2, about 65% for more than 2

11) No chance for 1 component, but about 30% for 2, another 20% for 3.

12) Nothing

1) 6 out of 12 of these runs, 50%, had no discernible structure.

2) 5 out of 12, 42%%, had above 10% confidence for a single component

3) Only 2 of 12, 17%, supported a single component at more than 20% confidence

4) 4 of 12, 33%, supported two components at more than 20% confidence

I then repeated the above with the extra 5 stars with the following results:

1) Nothing

2) Nothing

3) 15% for 1 component, 18% for 2

4) 5% for 1, but 27% for 2 and 20% for 3

5) Nothing

6) Nothing

7) 10% for 1, and a whopping 30% for 2

8) Nothing of interest, but slightly more probable than those other nothings--5% for 1

9) 33% for 1, 26% for 2

10) Holy crap, 50% for 1, 24% for 2

11) Absolutely nothing for 1, 28% for 2

12) Nothing

1) Again, 50% had no discernable structure

2) Only 33% detected a single component at more than 10% confidence

3) Only 17% at more than 20%

4) 5 out of 12, 42%, had a double component composition at above 20% confidence

I then devised a code that would automate the Nmixing, reading the files, and then writing the first line of the .k file (posterior probability of each k) to a new file.

I used this to bootstrap and Nmix Hercules 20 times, with rather surprising results.

Remember that Hercules only has 30 stars, but an Nmix on the actual data gives a 49% probability of a single component.

What was surprising was that only 20% (4 of 20) of the bootstrapped samples had higher than a 10% probability of a single component. 3 out of 4 of those positive results were above 35%, but it was still rather discouraging. This means that 80% of the bootstrapped samples had practically no discernible overriding structure.

I'm not sure if this means that there just aren't enough stars (30) for proper bootstrapping or that Nmix just sucks. I need to test it on Segue I, with its higher (70%) single component confidence level.

TO DO

1) Do the same stuff to Segue1, remembering to exclude the last stars that don't belong

2) Try messing around with Bootes II, making sure that the right stars are selected from the data (Member is 1)

3) KNOW MY BENS!

I'll summarize the results of the 12 trials that I just recorded:

1) This one had almost no discernible structure (~90% for more than 2 components)

2) Again, no discernible structure (~90% for more than 2 components)

3) No structure (~96% for more than 2 components)

4) About 10% confidence for 1 component, 75% for more than 2

5) ABSOLUTELY ZERO STRUCTURE

6) Lotta structure. 21% for 1 component, 55% more than 2

7) Reasonably high confidence for 1 component: 46%, ~30% more than 2

8) No structure

9) Around 10% for 1 component, about 75% more than 2

10) 13% for 1 component, 20% for 2, about 65% for more than 2

11) No chance for 1 component, but about 30% for 2, another 20% for 3.

12) Nothing

__Summary__1) 6 out of 12 of these runs, 50%, had no discernible structure.

2) 5 out of 12, 42%%, had above 10% confidence for a single component

3) Only 2 of 12, 17%, supported a single component at more than 20% confidence

4) 4 of 12, 33%, supported two components at more than 20% confidence

**With the High-Metallicity Stars:**I then repeated the above with the extra 5 stars with the following results:

1) Nothing

2) Nothing

3) 15% for 1 component, 18% for 2

4) 5% for 1, but 27% for 2 and 20% for 3

5) Nothing

6) Nothing

7) 10% for 1, and a whopping 30% for 2

8) Nothing of interest, but slightly more probable than those other nothings--5% for 1

9) 33% for 1, 26% for 2

10) Holy crap, 50% for 1, 24% for 2

11) Absolutely nothing for 1, 28% for 2

12) Nothing

__Summary__1) Again, 50% had no discernable structure

2) Only 33% detected a single component at more than 10% confidence

3) Only 17% at more than 20%

4) 5 out of 12, 42%, had a double component composition at above 20% confidence

**Bootstrapping Hercules**I then devised a code that would automate the Nmixing, reading the files, and then writing the first line of the .k file (posterior probability of each k) to a new file.

I used this to bootstrap and Nmix Hercules 20 times, with rather surprising results.

Remember that Hercules only has 30 stars, but an Nmix on the actual data gives a 49% probability of a single component.

What was surprising was that only 20% (4 of 20) of the bootstrapped samples had higher than a 10% probability of a single component. 3 out of 4 of those positive results were above 35%, but it was still rather discouraging. This means that 80% of the bootstrapped samples had practically no discernible overriding structure.

I'm not sure if this means that there just aren't enough stars (30) for proper bootstrapping or that Nmix just sucks. I need to test it on Segue I, with its higher (70%) single component confidence level.

TO DO

1) Do the same stuff to Segue1, remembering to exclude the last stars that don't belong

2) Try messing around with Bootes II, making sure that the right stars are selected from the data (Member is 1)

3) KNOW MY BENS!

## Thursday, June 11, 2009

### Nmix Paragraph

The Nmix program, coded by Peter Green in 1997, is a program for Bayesian analysis of univariate normal mixtures, implementing the approach of Richardson and Green, Journal of the Royal Statistical Society, B, 59, 731-792 (1997). It utilizes a reversible jump Markov Chain Monte-Carlo method, which gives the program the ability to determine the composition of a mixture with an unknown number of components, assuming that they are normally distributed. As it runs it divides the data into various numbers of components,

I'll add a To Do list to this post, since I don't have enough to write for a full post.

1) Write a code that generates a bunch of bootstrapped data files, writes them to a file, then Nmixes that file, then writes the relevant data to another file. Should be pretty complicated.

2) Decide if the Nmix fail on bootstrapping things is really that typical.

3) Think of a third thing for this To Do list.

*k*, splitting or combining components based on calculated probabilities. At the end of the run-cycle it estimates the posterior probabilities for each*k*. These probabilities are a calculated likelihood for each k-component fit. It also determines the most likely means and variances for each of the components included in each*k*.I'll add a To Do list to this post, since I don't have enough to write for a full post.

1) Write a code that generates a bunch of bootstrapped data files, writes them to a file, then Nmixes that file, then writes the relevant data to another file. Should be pretty complicated.

2) Decide if the Nmix fail on bootstrapping things is really that typical.

3) Think of a third thing for this To Do list.

## Wednesday, June 10, 2009

### Willman I vs. the Nmix

I don't feel like typing up each individual stat, as there are a lot for this galaxy. So I'll just copy and paste the relevant Nmix data from the .pe file (see previous posts to understand what the numbers mean):

1 173881

1.00000 -13.49319 6.00589

2 283751

0.56915 -17.32806 3.94248

0.43084 -8.08833 3.94228

3 198372

0.36418 -19.62466 3.61210

0.33942 -13.29231 3.86374

0.29639 -5.97242 3.56708

4 124760

0.24127 -21.75585 3.47407

0.29321 -15.86464 3.56630

0.24968 -10.76959 3.66620

0.21583 -4.12989 3.38790

What's interesting about this is that both the double and triple component fits are more probable than the single component fit. Of course, this is without taking metallicity into account, which could possibly remove the stars that are Milky Way contaminants.

But from looking at the histograms with the fits plotted, the two-component model looks pretty darn good. But that relies on my scaling of the curves by eye as I can't really figure out how to truly scale them.

I did the same thing for the Willman I data without the high metallicity contaminant stars and got 21% for one component, 26% for two, and 18% for three. So it's pretty likely that there are multiple populations in there.

TO DO:

1) Write a paragraph about Nmix--what type of statistical grouping test it is

2) Bootstrap the Willman I data--take a data set of 40/45 (40 if we don't include the high metallicity stars) of the stars with replacement, then Nmix that, record results

3) Find some way to program the above into IDL

4) ?????

5) PROFIT!

1 173881

1.00000 -13.49319 6.00589

2 283751

0.56915 -17.32806 3.94248

0.43084 -8.08833 3.94228

3 198372

0.36418 -19.62466 3.61210

0.33942 -13.29231 3.86374

0.29639 -5.97242 3.56708

4 124760

0.24127 -21.75585 3.47407

0.29321 -15.86464 3.56630

0.24968 -10.76959 3.66620

0.21583 -4.12989 3.38790

What's interesting about this is that both the double and triple component fits are more probable than the single component fit. Of course, this is without taking metallicity into account, which could possibly remove the stars that are Milky Way contaminants.

But from looking at the histograms with the fits plotted, the two-component model looks pretty darn good. But that relies on my scaling of the curves by eye as I can't really figure out how to truly scale them.

I did the same thing for the Willman I data without the high metallicity contaminant stars and got 21% for one component, 26% for two, and 18% for three. So it's pretty likely that there are multiple populations in there.

TO DO:

1) Write a paragraph about Nmix--what type of statistical grouping test it is

2) Bootstrap the Willman I data--take a data set of 40/45 (40 if we don't include the high metallicity stars) of the stars with replacement, then Nmix that, record results

3) Find some way to program the above into IDL

4) ?????

5) PROFIT!

## Tuesday, June 9, 2009

### Nmix and Match

Today I decided to use Nmix on the velocity data for Segue 1 (1 million sweeps, burn-in of 1 million), giving me the third set of data in the Segue Folder.

This gave me a 68.74% probability of one component with a mean of 207.85 and standard deviation of 8.29 (I still need to figure out how to find the errors in the mean).

But there is a 17.38% probability of two components, the first composing 46.6% of the stars with a mean of 201.29 and sd of 7.82, and the second has 53.4% of the stars with a mean of 212.58 and sd of 7.54.

I made a graph of the histogram with these Gaussians plotted on it, but am currently failing at getting it onto the internet.

I then Nmix'd the Hercules data:

There is a 48.78% probability of one component with a mean of 45.35 and sd of 6.33.

There is also a 20.36% chance of two components, the first with 51.0% of the stars with a mean of 40.94 and sd of 5.46, the second with 49.0% of the stars with a mean of 50.20 and sd of 6.00.

There is still a 10.55% chance of three components, although it doesn't really work based on the histogram.

I think the high percentage of multiple groups is just because there are only 30 stars in the file and there isn't really that much to group.

But I still made a bunch of sweet and COLORFUL graphs that show the fits on the histograms. They're pretty cool.

I'll even throw in a To-Do List because they seem to be the newest rage among my fellow researchers.

TO DO:

1) Nmix Willman I and Bootes II and make cool plots.

2) Figure out how to find the error in the means given by Nmix.

3) Actually read more of the stat books I got from the library.

4) Discover why Coke Zero is so much tastier than Diet Coke, despite having zero calories. It's a mystery!

This gave me a 68.74% probability of one component with a mean of 207.85 and standard deviation of 8.29 (I still need to figure out how to find the errors in the mean).

But there is a 17.38% probability of two components, the first composing 46.6% of the stars with a mean of 201.29 and sd of 7.82, and the second has 53.4% of the stars with a mean of 212.58 and sd of 7.54.

I made a graph of the histogram with these Gaussians plotted on it, but am currently failing at getting it onto the internet.

**Hercules!**I then Nmix'd the Hercules data:

There is a 48.78% probability of one component with a mean of 45.35 and sd of 6.33.

There is also a 20.36% chance of two components, the first with 51.0% of the stars with a mean of 40.94 and sd of 5.46, the second with 49.0% of the stars with a mean of 50.20 and sd of 6.00.

There is still a 10.55% chance of three components, although it doesn't really work based on the histogram.

I think the high percentage of multiple groups is just because there are only 30 stars in the file and there isn't really that much to group.

But I still made a bunch of sweet and COLORFUL graphs that show the fits on the histograms. They're pretty cool.

I'll even throw in a To-Do List because they seem to be the newest rage among my fellow researchers.

TO DO:

1) Nmix Willman I and Bootes II and make cool plots.

2) Figure out how to find the error in the means given by Nmix.

3) Actually read more of the stat books I got from the library.

4) Discover why Coke Zero is so much tastier than Diet Coke, despite having zero calories. It's a mystery!

### Deciphering Nmix

So I finally have a working version of Nmix, the program that uses the Reversible Jump Markov Chain Monte Carlo method to determine the number of components in some data set.

As far as I can figure out, the k file contains information about the number of components, k.

The first block is the posterior probabilities for each k, with the most probable obviously being the most likely number of components.

The next block contains the Bayes factors, which I don't yet understand.

The next is the prior distribution of k, which defaults to 1 for all k.

The last block is the posterior distribution of kpos, the number of nonempty components. This is pretty similar to the distribution for k, and I'm not entirely sure how they differ.

The pe file contains an even better summary of the data. It has a section for each k (the number of components) that the program visits.

The first number is the number of visits to that k or something. But it directly corresponds to the probability distribution in the k file (that number over the total number of sweeps).

Then we get to the interesting stuff. The first column is the weight of each component (the percent of data belonging to each).

The next one is the mean of each component, and the last is the standard deviation

But that doesn't include the error in the mean and I'll need to delve a little deeper to find it.

As far as I can figure out, the k file contains information about the number of components, k.

The first block is the posterior probabilities for each k, with the most probable obviously being the most likely number of components.

The next block contains the Bayes factors, which I don't yet understand.

The next is the prior distribution of k, which defaults to 1 for all k.

The last block is the posterior distribution of kpos, the number of nonempty components. This is pretty similar to the distribution for k, and I'm not entirely sure how they differ.

The pe file contains an even better summary of the data. It has a section for each k (the number of components) that the program visits.

The first number is the number of visits to that k or something. But it directly corresponds to the probability distribution in the k file (that number over the total number of sweeps).

Then we get to the interesting stuff. The first column is the weight of each component (the percent of data belonging to each).

The next one is the mean of each component, and the last is the standard deviation

But that doesn't include the error in the mean and I'll need to delve a little deeper to find it.

## Monday, June 8, 2009

### The Library Quest

While looking at some websites about skewness and kurtosis I found a reference to a certain book that was theoretically in Magill library. It supposedly had an estimate for the error of both kurtosis and skewness and I was curious how they derived it.

So I copied down its call number and set off on my journey to Magill. Up to the 5th Tier I went, only to find that there was no QA section, just Q and QB. Well there was a sign in the stairs that said that most Q's were in the Science Library, so I walked back to the INSC and continued my search. Just as I had thought, there was a whole section on what I was looking for, filled with Bayesian Inference and statistical analysis. But no matter how hard I looked, my books (I found another that I wanted to look at) were nowhere to be found. So I asked the librarian and we both agreed that they were not where they should be.

But we eventually decided that they actually WERE in Magill, despite having nearly identical subject matter to the stuff in the SciLi. Thinking that I was somehow blind the first time I went there, I trudged all the way back up to the fifth floor of Magill. Yet the books were still nowhere to be found. Eventually when I was about to give up, I discovered that the wall did not in fact stretch all the way across the room. Yet there was no way to notice this when you were actually looking for a book in the shelves. The shelves actually continued on the other side of the wall as if it weren't there. And I found out that my books were on the other side of the freaking wall the whole time. Stupid library organization.

Well now I have two statistical books and one even has a whole chapter on Mixture Models (aka what I'm doing). And I just got an email from Jay, telling me how to actually run Nmix. Things are looking up!

So I copied down its call number and set off on my journey to Magill. Up to the 5th Tier I went, only to find that there was no QA section, just Q and QB. Well there was a sign in the stairs that said that most Q's were in the Science Library, so I walked back to the INSC and continued my search. Just as I had thought, there was a whole section on what I was looking for, filled with Bayesian Inference and statistical analysis. But no matter how hard I looked, my books (I found another that I wanted to look at) were nowhere to be found. So I asked the librarian and we both agreed that they were not where they should be.

But we eventually decided that they actually WERE in Magill, despite having nearly identical subject matter to the stuff in the SciLi. Thinking that I was somehow blind the first time I went there, I trudged all the way back up to the fifth floor of Magill. Yet the books were still nowhere to be found. Eventually when I was about to give up, I discovered that the wall did not in fact stretch all the way across the room. Yet there was no way to notice this when you were actually looking for a book in the shelves. The shelves actually continued on the other side of the wall as if it weren't there. And I found out that my books were on the other side of the freaking wall the whole time. Stupid library organization.

Well now I have two statistical books and one even has a whole chapter on Mixture Models (aka what I'm doing). And I just got an email from Jay, telling me how to actually run Nmix. Things are looking up!

### More Kurtosis

Until I figure out what's wrong with Nmix I'll just learn more about kurtosis.

Jen just asked me what the hell kurtosis was, so I should probably explain it a bit better. Technically it's defined as the "standardized fourth population moment about the mean," but that just sounds like mumbo jumbo to me.

Sample kurtosis is defined on this Wikipedia page.

Kurtosis is just another characterization of distributions, like the mean and variance. But while the mean is the center of the distribution and variance determines its wideness, kurtosis measures how pointy the peak is and thick the tails are. I'm not really sure how it does that, but it does. Kurtosis of 0 indicates a normal/Gaussian distribution, positive indicates a pointy peak and thick tails, and negative means the opposite.

Since bimodal distributions typically have two peaks in close proximity, they essentially form a fat and flat peak to the distribution. Well flat peaks are typical of negative kurtosis, so negative kurtosis can be an indication of bimodality. The lowest possible value of kurtosis is -2, so any distribution with a kurtosis around there is definitely not normal. It might not be bimodal, but that is definitely a possibility, especially when dealing with data that is supposed to be normally distributed like velocities.

One problem is that different shaped curves can have the same kurtosis. Since it relies on both the pointedness of the peak and the thickness of the tails, changes in one or the other can cancel out the differences in the other factor. Thus a pointy distribution with thin tails can have the same kurtosis as a flatter curve with thick tails.

So if a data set has a lot of outliers there would be thicker tails and a higher kurtosis than the actual value.

But limitations like that are typical of all shape characteristics. For example, the mean by itself doesn't tell much about the actual shape of the curve. And the variance doesn't tell much about the skewness or kurtosis. But all together these characteristics give a pretty good estimate of the shape.

Based on:

On the Meaning and Use of Kurtosis (1997)

Lawrence T. DeCarlo

--Google it for a PDF.

Jen just asked me what the hell kurtosis was, so I should probably explain it a bit better. Technically it's defined as the "standardized fourth population moment about the mean," but that just sounds like mumbo jumbo to me.

Sample kurtosis is defined on this Wikipedia page.

Kurtosis is just another characterization of distributions, like the mean and variance. But while the mean is the center of the distribution and variance determines its wideness, kurtosis measures how pointy the peak is and thick the tails are. I'm not really sure how it does that, but it does. Kurtosis of 0 indicates a normal/Gaussian distribution, positive indicates a pointy peak and thick tails, and negative means the opposite.

**Kurtosis and Bimodality**Since bimodal distributions typically have two peaks in close proximity, they essentially form a fat and flat peak to the distribution. Well flat peaks are typical of negative kurtosis, so negative kurtosis can be an indication of bimodality. The lowest possible value of kurtosis is -2, so any distribution with a kurtosis around there is definitely not normal. It might not be bimodal, but that is definitely a possibility, especially when dealing with data that is supposed to be normally distributed like velocities.

**Limitations of Kurtosis**One problem is that different shaped curves can have the same kurtosis. Since it relies on both the pointedness of the peak and the thickness of the tails, changes in one or the other can cancel out the differences in the other factor. Thus a pointy distribution with thin tails can have the same kurtosis as a flatter curve with thick tails.

So if a data set has a lot of outliers there would be thicker tails and a higher kurtosis than the actual value.

But limitations like that are typical of all shape characteristics. For example, the mean by itself doesn't tell much about the actual shape of the curve. And the variance doesn't tell much about the skewness or kurtosis. But all together these characteristics give a pretty good estimate of the shape.

Based on:

On the Meaning and Use of Kurtosis (1997)

Lawrence T. DeCarlo

--Google it for a PDF.

## Friday, June 5, 2009

### Nmix and Kurtosis

I ran the Nmix program to see if it worked, but after 2 hours it's still going, meaning that I probably did something wrong. But I emailed Jay, who supposedly knows how to use the program and hopefully he'll get back to me sometime.

While that was running in the background I decided to learn a bit more about kurtosis. Something with positive kurtosis is pointy with thick tails and something with negative kurtosis is flat with thin tails. This is because kurtosis doesn't affect the variance. Since more points are concentrated in the middle with positive kurtosis, the variance would decrease unless more points were added to the tails. This stuff about the tails is important and I didn't really pay attention to it before.

But I just got an email from Jay, which I'll analyze over the weekend and try out next week.

Hmm. The short Nmix doesn't ever end either. Something is clearly wrong. Oh well, that's what I'll do next week when the computers actually have power.

While that was running in the background I decided to learn a bit more about kurtosis. Something with positive kurtosis is pointy with thick tails and something with negative kurtosis is flat with thin tails. This is because kurtosis doesn't affect the variance. Since more points are concentrated in the middle with positive kurtosis, the variance would decrease unless more points were added to the tails. This stuff about the tails is important and I didn't really pay attention to it before.

But I just got an email from Jay, which I'll analyze over the weekend and try out next week.

Hmm. The short Nmix doesn't ever end either. Something is clearly wrong. Oh well, that's what I'll do next week when the computers actually have power.

### Reversible jump Markov chain Monte Carlo

This is based on this paper: On Bayesian Analysis of Mixtures with an Unknown Number of Components

Now I have no idea what a reversible jump MCMC actually does, despite my efforts to learn. But I have learned that they can somehow be used along with Bayesian inference to determine the parameters of a data set.

This method, called Nmix by Peter Green, somehow sweeps through various numbers of components (k in the paper), testing the probability of each k. How it does this, I have no idea, but it involves a "birth-death move." From the probabilities, you can then make an educated guess as to the true number of components.

Unfortunately, it relies on a lot of parameters that I don't understand yet and aren't even defined in the paper. These are things like lambda, delta, and nu, and are referred to as "hyperparameters." According to Wikipedia, these are parameters for the prior distribution, which is the knowledge about the data before any observations.

Defaults for these values are listed, and as I don't really know what they individually mean (other than being the hyperparameter for other variables), I guess I'll just use the defaults. They justify the defaults through some sort of somewhat incomprehensible statistical analysis that I assume they understand, so I'll trust them.

There are some other justifications of the method, saying how after enough sweeps the k should converge no matter the starting values.

Well that seems about it. After lunch I think I'll start messing around with the program to see what it does.

Now I have no idea what a reversible jump MCMC actually does, despite my efforts to learn. But I have learned that they can somehow be used along with Bayesian inference to determine the parameters of a data set.

This method, called Nmix by Peter Green, somehow sweeps through various numbers of components (k in the paper), testing the probability of each k. How it does this, I have no idea, but it involves a "birth-death move." From the probabilities, you can then make an educated guess as to the true number of components.

Unfortunately, it relies on a lot of parameters that I don't understand yet and aren't even defined in the paper. These are things like lambda, delta, and nu, and are referred to as "hyperparameters." According to Wikipedia, these are parameters for the prior distribution, which is the knowledge about the data before any observations.

Defaults for these values are listed, and as I don't really know what they individually mean (other than being the hyperparameter for other variables), I guess I'll just use the defaults. They justify the defaults through some sort of somewhat incomprehensible statistical analysis that I assume they understand, so I'll trust them.

There are some other justifications of the method, saying how after enough sweeps the k should converge no matter the starting values.

Well that seems about it. After lunch I think I'll start messing around with the program to see what it does.

## Thursday, June 4, 2009

### Summary of the Ashman et al. Paper

The point of the paper is to demonstrate how the KMM (Kayes Mixture Model. I finally figured out what it stands for!) algorithm can be used to detect bimodality (two groups) in a data set.

The user specifies the number of groups they expect and the predicted mean and standard deviation of each. But if the program works as it should , it will converge to the correct values no matter the initial estimates.

The program first finds the best fit single Gaussian for the data, then calculates a likelihood parameter of that fit.

It then attempts to divide all of the data points into the specified number of groups by calculating the probability that each point of data belongs with each group and putting it into the group with the highest probability. This is then repeated until the number of objects in each group becomes constant. Another likelihood parameter is calculated for this new fit, which is then compared with the initial likelihood parameter to get the likelihood ratio test statistic (LRTS). This is an estimate of the improvement in going from a single to a multiple group fit.

But that statistic is not nearly as useful as the "P-value," which is the "probability that the LRTS would be at least as large as the observed value if the null hypothesis were true." For most cases, this is the probability that we would get that same LRTS if we were looking at a true single group Gaussian instead of something with multiple groups. So, the smaller the better. A P-value of .05 means that there is only a 5% chance of getting that LRTS when looking at a single group.

Double Root Residuals (DRR)

This is a fancy method of comparing the data with a specified model distribution. It essentially compares the square root of the number of data points in each bin to the square root of the expected value from the model and then plots the result. It's a little more complicated than that, but that's the basic idea. If the result is greater than 2 at a certain bin, then there is a discrepancy between the model and the data at the 2 sigma (95%) level. This can be used to compare the data to a unimodal model to see if they strongly disagree.

Shape Estimators

They also use both skewness and kurtosis to measure the Gaussian-ness of the data. Skewness measures the symmetry of the data (skew of 0 is perfectly symmetric, negative skew means that most of the mass is concentrated to the right). Kurtosis measures the "peakedness" of the data. A kurtosis of 0 indicates a pure Gaussian, a positive kurtosis means that the data is pointier than a Gaussian, and negative means that the peak is flatter. Thus data sets with multiple groups typically have negative kurtosis because the central peak is spread out.

Paper about Skewness and Kurtosis

While some of these tests may give false indication of bimodality or unimodality, when they are all combined they are very effective in determining the number of groups in a data set.

Summary

The P-value obtained by this method can only be compared to a chi-squared distribution when both groups have the same variance. If they don't, bootstrapping must be done, which involves taking a random sample of the data with replacement and then calculating its statistics. I'm not entirely sure how this works, but I don't think I need to do it.

There is also the problem of determining just how many groups are in the data set. The P-value is a useful guideline to the significance of the value, but for a more accurate value you need to bootstrap again. At the time of the paper's writing (1994) there was no single powerful technique to determine the number of groups in a datset, but it is generally accepted that the model that works well with the least number of groups is the best model. This follow's from Occam's razor.

The user specifies the number of groups they expect and the predicted mean and standard deviation of each. But if the program works as it should , it will converge to the correct values no matter the initial estimates.

The program first finds the best fit single Gaussian for the data, then calculates a likelihood parameter of that fit.

It then attempts to divide all of the data points into the specified number of groups by calculating the probability that each point of data belongs with each group and putting it into the group with the highest probability. This is then repeated until the number of objects in each group becomes constant. Another likelihood parameter is calculated for this new fit, which is then compared with the initial likelihood parameter to get the likelihood ratio test statistic (LRTS). This is an estimate of the improvement in going from a single to a multiple group fit.

But that statistic is not nearly as useful as the "P-value," which is the "probability that the LRTS would be at least as large as the observed value if the null hypothesis were true." For most cases, this is the probability that we would get that same LRTS if we were looking at a true single group Gaussian instead of something with multiple groups. So, the smaller the better. A P-value of .05 means that there is only a 5% chance of getting that LRTS when looking at a single group.

Double Root Residuals (DRR)

This is a fancy method of comparing the data with a specified model distribution. It essentially compares the square root of the number of data points in each bin to the square root of the expected value from the model and then plots the result. It's a little more complicated than that, but that's the basic idea. If the result is greater than 2 at a certain bin, then there is a discrepancy between the model and the data at the 2 sigma (95%) level. This can be used to compare the data to a unimodal model to see if they strongly disagree.

Shape Estimators

They also use both skewness and kurtosis to measure the Gaussian-ness of the data. Skewness measures the symmetry of the data (skew of 0 is perfectly symmetric, negative skew means that most of the mass is concentrated to the right). Kurtosis measures the "peakedness" of the data. A kurtosis of 0 indicates a pure Gaussian, a positive kurtosis means that the data is pointier than a Gaussian, and negative means that the peak is flatter. Thus data sets with multiple groups typically have negative kurtosis because the central peak is spread out.

Paper about Skewness and Kurtosis

While some of these tests may give false indication of bimodality or unimodality, when they are all combined they are very effective in determining the number of groups in a data set.

Summary

The P-value obtained by this method can only be compared to a chi-squared distribution when both groups have the same variance. If they don't, bootstrapping must be done, which involves taking a random sample of the data with replacement and then calculating its statistics. I'm not entirely sure how this works, but I don't think I need to do it.

There is also the problem of determining just how many groups are in the data set. The P-value is a useful guideline to the significance of the value, but for a more accurate value you need to bootstrap again. At the time of the paper's writing (1994) there was no single powerful technique to determine the number of groups in a datset, but it is generally accepted that the model that works well with the least number of groups is the best model. This follow's from Occam's razor.

## Wednesday, June 3, 2009

### Bimodality and Such

Continuing from yesterday, I continued to run some simulations to see whether Gaussian data sets could ever be completely sigma-clipped away. I did this for sigma-clipping of 2.75, 2.5, and 1.0, before eventually deciding that if we include the measurement error in the sigma-clip parameters there is absolutely no way to remove all of the data. The measurement error on some of the points was way larger than the standard deviation of the entire sample, so those data points were almost never removed by the sigma-clipping.

The rest of the day (at least until now) was spent learning about the KMM algorithm, which can be used to detect bimodality in data samples. Bimodality means that there are two peaks in the sample, which for astronomical data typically means that there are two stellar populations in the sample. This can be used to detect multiple galaxies or star clusters or just stars from the same population!

The cool thing about the algorithm is that it can detect bimodalities in samples that appear somewhat single-Gaussian to the naked eye when put into a histogram. But it's not perfect and works best when both samples have the same standard deviation, something which I believe to be rather unlikely in the real world. But it still gives a decent indication that something other than a single population is in the sample.

The rest of the day (at least until now) was spent learning about the KMM algorithm, which can be used to detect bimodality in data samples. Bimodality means that there are two peaks in the sample, which for astronomical data typically means that there are two stellar populations in the sample. This can be used to detect multiple galaxies or star clusters or just stars from the same population!

The cool thing about the algorithm is that it can detect bimodalities in samples that appear somewhat single-Gaussian to the naked eye when put into a histogram. But it's not perfect and works best when both samples have the same standard deviation, something which I believe to be rather unlikely in the real world. But it still gives a decent indication that something other than a single population is in the sample.

## Tuesday, June 2, 2009

### First Two Days of Research

After relearning IDL and the Linux command line interface, I was ready to begin.

This week I'll be dealing with the fact that gravitationally bound objects (like dwarf galaxies) have normally distributed velocity distributions and tidally disrupted objects most likely do not.

So if we look at the velocity distributions of the stars in what we believe to be a dwarf galaxy, like Segue I or Bootes II, we expect a Gaussian distribution. There is a process called Sigma Clipping, wherein one eliminates all data greater than the typical value of 3 standard deviations away from the mean. This process is then repeated until no more data is removed.

According to conventional wisdom, a Gaussian distribution should maintain its form and not be sigma-clipped into oblivion. The same logic says that other distributions can continue to lose data after each sigma-clip until there is no data left.

My goal was to test a randomly distributed Gaussian sample that was based on the mean velocity and uncertainties of Segue I's velocity distribution. I initially discovered that without incorporating the individual measurement error, there was a very high probability (greater than 10%) that a Gaussian distribution would be completely obliterated by sigma-clipping! This would imply that a dwarf galaxy could appear to be a tidally disrupted even while truly having a normalized velocity distribution.

However, once I added in the individual measurement errors I found that sigma-clipping (at least at 3 sigma) removed almost no stars from the distribution.

Tomorrow I'll repeat the process for sigma-clipping at 2.5 or 2.75 standard deviations.

This week I'll be dealing with the fact that gravitationally bound objects (like dwarf galaxies) have normally distributed velocity distributions and tidally disrupted objects most likely do not.

So if we look at the velocity distributions of the stars in what we believe to be a dwarf galaxy, like Segue I or Bootes II, we expect a Gaussian distribution. There is a process called Sigma Clipping, wherein one eliminates all data greater than the typical value of 3 standard deviations away from the mean. This process is then repeated until no more data is removed.

According to conventional wisdom, a Gaussian distribution should maintain its form and not be sigma-clipped into oblivion. The same logic says that other distributions can continue to lose data after each sigma-clip until there is no data left.

My goal was to test a randomly distributed Gaussian sample that was based on the mean velocity and uncertainties of Segue I's velocity distribution. I initially discovered that without incorporating the individual measurement error, there was a very high probability (greater than 10%) that a Gaussian distribution would be completely obliterated by sigma-clipping! This would imply that a dwarf galaxy could appear to be a tidally disrupted even while truly having a normalized velocity distribution.

However, once I added in the individual measurement errors I found that sigma-clipping (at least at 3 sigma) removed almost no stars from the distribution.

Tomorrow I'll repeat the process for sigma-clipping at 2.5 or 2.75 standard deviations.

Subscribe to:
Posts (Atom)