Answering 4 RG Questions related to PCA

in #steemstem6 years ago

Last week I was checking Alexa and realized that there are 1.070 backlinks to Steemit. Not bad, but still not all that much that it's impossible to make a difference.


There was something else, also encouraging. Although we all brag about the content on Steemit, the majority of traffic is actually - organic. "Bouncing rate" is fine as well. 


So let's bring more traffic by targeting specific niches.

It could be useful in general and particularly for our STEM community.


Questions:


  • Should you implement correlation before or after factor analysis? (link)
  • What is your suggested solution, when the correlation matrix is not positive definite? (link)
  • How can we interpret negative factor loading? (link)
  • Factor Analysis: Which method and rotation should I use? (link)

    According to their RG score, all the participants are young, and it would be nice to help them to understand the technique and to develop their intuition.

    If you ask Scholar.Google what it thinks about PCA, you will find almost a Million results:

    The history of PCA is dating back to 1901 and the famous Karl Pearson.
    By the beginning of WWII, the math behind PCA became fully developed.
    After the war, it was fully embraced by virtually all the disciplines. 
    When Jordan Peterson is speaking about The Big Five - he is referring to PCA.
    When you read something about the ingredients of food - you will see PCA plot.
    Even when you are reading about molecular biology, PCA will appear!
    Figure 1
    Great study about the origin of African-Americans. Open Access, Creative Commons 2.0, link

    However... It's actually not all that good, for many applications. 

    PCA can be your starting point, but you should always try to refine the analysis by applying more advanced techniques.

    Sometimes, there will be no differences. In that case, PCA is just fine. Otherwise, try at least with factor analysis, ICA, NMF, SOBI-RO or if you need the best - try TDA (with PCA and elevate the results to the new heights)

    Should you implement correlation before or after factor analysis?


    After you pass your 20th dataset, you will develop some sort of intuition, what is working well and what is not working - at all. 


    It's useful to construct correlation matrix before you begin.

    It could happen that one of the components is the complete "outgroup" and that "column" / "row" will affect the result by "compressing" actually relevant data to the unsolvable cluster.

    If that is the case, ok, do PCA with all the data. Remove those problematic. And run the analysis again!

    Check the link that I gave you and see the difference


    Another scenario could be that you will find several groups of data, strongly correlated within the group and very distinctive from other groups. 

    If you use the "whole dataset", you will fine - those groups!

    If you want to look deeper, analyse those group separately.


    What is your suggested solution, when the correlation matrix is not positive definite?


    By the definition that I've found here:

    > A matrix which fails this test is "not positive definite." If the determinant of the matrix is exactly zero, then the matrix is "singular."


    In practice:

      The solution is simple, delete :D


      How can we interpret negative factor loading


      Factor loading corresponds to correlation coefficient. If it's negative - you have negative correlation. From the example I gave you, European vs African if you are reading alongside PC-1 and if you read alongside PC-2, what do you see?


      There are cases when you strictly need to extract only positive values. 

      For example, fluorescence is additive, every component is contributing in positive manner... In that case, you can't use PCA. You need NMF for example. 


      Factor Analysis: Which method and rotation should I use?


      This one is difficult, both from "phylosophical" and from practical perspective.


      If you know what you should expect, in spectroscopy for example - it's easy.

      Acept only those values with physical meaning.

      If you don't know what to expect... God mercy on your soul... The problem is:

      result = compoent x coefficient + error

      If your componets are wrong - your coefficients are wrong as well! You are doomed :D


      There is a rule of thumb:

      do you have any column full of zeros or some NaN?

      do you have missing values

      do you have the whole column with the same values (usually zeros or too high values, saturation...)

      do you have a completely linear change which is exactly the same (example: 1 2 5 7   and   10 20 50 70)

      • try without rotation
      • try with orthogonal rotation
      • try with oblique rotation

      For the majority of datasets, all three solutions will be very similar. Use the first one, avoid questions "why were you complicating for no reason?"

      If the components are independent(ish), the second solution will be different from the first and the third one will be almost the same as the second. Use the second one

      If the components are very similar, the third option will be very different from the second one. Usually in spectroscopy, especially fluorescence spectroscopy, use the third one


      JOIN STEEMIT, BLOGGING PLATFORM, AND EARN CRYPTO!






      Sort:  




      This post has been voted on by the SteemSTEM curation team and voting trail in collaboration with @utopian-io.

      If you appreciate the work we are doing then consider voting both projects for witness by selecting stem.witness and utopian-io!

      For additional information please join us on the SteemSTEM discord and to get to know the rest of the community!

      To listen to the audio version of this article click on the play image.

      Brought to you by @tts. If you find it useful please consider upvoting this reply.

      Coin Marketplace

      STEEM 0.18
      TRX 0.16
      JST 0.029
      BTC 63004.21
      ETH 2441.89
      USDT 1.00
      SBD 2.68