There is not enough context to provide any specific advice so here are my initial thoughts/questions:
1. I assume F and P are fail and pass. Fail and pass what? Is it a specification? Can you make it a continuous measurement? Or can you develop a measurement that will quantify the "phenomena" better? The reason I say this is because attribute/discrete categorical response variables are inefficient. They require more experimental units to detect changes and assign cause. They also have the flaw of being aggregate of potentially many failure modes/mechanisms. Have you studied the measurement system that is categorizing F/P? Is the 15% consistent?
2. I also want to point out a distinction...Are you trying to explain why failures occur or are you trying to predict failures? Sampling may be a more effective way to explain failures.
3. Are you trying to understand causal structure or "pick a winner"? Do you plan on iterating?
4. The question about level setting suggests you to consider the assumptions. In DOE, it is "assumed" that if you set at factor at level 1 and change it to another level and then set it back to level 1 again, that is set to the identical level as the first time. Now reality says that is impossible as variation exists in everything, so one way to reduce the impact of this handicap is to ensure the levels are set boldly different (bold but reasonable). And of course, the boldness across multiple factors is about the same (unbiased).
5. Not sure you need to block on all 3 lines. Why are the lines different? (hypotheses) Can you pick 2 lines that are most different as a starting point? Do you care about whether the factor effects from one line are different from another line (RCBD)? Or do you just want to quantify the effect of line and increase the precision of the design (BIB)?
6. Are there other "noise" variables that need to be accounted for? For example, raw material lots or operator technique.
"All models are wrong, some are useful" G.E.P. Box