Hi @statman,
Completely agree with you, no matter the data, it should be used with clear objectives to answer questions with sufficient information and precision, depending on the context. There are indeed a lot of questions on this specific topic to answer before being able to help OP.
To answer shortly, "supervised" and "unsupervised" learnings are two common terms in Machine Learning I'm referring to.
- In "supervised learning", we provide the machine learning model with labeled observations (containing one or several response values) to learn from, so it can predict the labels/responses of new, unseen examples.
- "Unsupervised learning" involves training the model on unlabeled data, where the model tries to find patterns, groups or structures on its own, without any predefined labels/responses (clustering use case). This is what I imply by the absence of "ground truth" : absence of labels/response values to verify and improve the model.
I hope this response will make the understanding of my previous answer easier.
I'm not implying any "global ground truth" can be revealed on anything by any model (even if we already know that 42 is the Answer to the Ultimate Question of Life, the Universe, and Everything).
I know this quote from a long time, and also use it a lot to explain statistical modeling and concepts.
@dale_lehman : Agree, if any response is present in the dataset, it should not be included in the clustering (to prevent data/information leakage), and your description looks related to the clustering of variables I'm referring to: use the clusters as inputs in the predictive model.
Victor GUILLER
L'Oréal Data & Analytics
"It is not unusual for a well-designed experiment to analyze itself" (Box, Hunter and Hunter)