This is a rather large topic. I don't know your current level of understanding experimental design? Have you researched conjoint analysis and Choice Designs? There are MANY case studies applying DOE successfully in marketing!
My thoughts:
1. It looks like a one factor experiment (OFAT) where the factor is the length of the creative. The diagram you attached appears to be a nested study. Typically, DOE is used to investigate a number of factors (2≤#≤15) and their possible interactions. The intent is to create variability in the response by manipulating factors at bold level settings. Setting the variables at bold levels exaggerates the factor effects thus making it easier to determine the relationships between the factors and the response(s).
2. Others may argue, but in experimentation, there is no need for a "hold out" group. As long as current levels are included in the study, this is not necessary as a comparison to current is embedded in the study.
3. You are proposing a huge sample size. This is also not typical of experimentation. In experimentation you increase the inference space by manipulation for factors and by blocking (or other similar techniques) for noise. I would suggest you start on a much smaller scale to understand more what affects the decision to join any loyalty program. As part of a sequential strategy, first screen from a large set of potential factors, then iterate on those factors to determine optimum settings.
4. Suggestions include developing more specific hypotheses as to what would and would not impact persons joining the loyalty program. It seems there are lots of possibilities (e.g., amount of $ coupon, other financial benefits, other perks/promotions/incentives, design of the email (font, graphics, colors, animation, sound), etc.).
5. There are also additional responses you may be interested in. Do you care how long they stay in the loyalty program? Are you concerned with the possible negative reaction from those that are already in the loyalty program (Where's my coupon?). The proposed metric appears to be categorical (nominal). Create an ordinal scale for the response.
6. Perhaps you should start with :
- A look at historical data for loyalty program membership (regression could be quite useful here) to generate more hypotheses
- A sampling plan. If you are going to have such a huge data set, why not sample to get some clues about what drives human behavior to join a loyalty program. This may also helping determining whether measurements systems are adequate and clues as to how to handle non-response.
"All models are wrong, some are useful" G.E.P. Box