It seems clear to me that there is an optimal starting word, but that the best second word has to depend on the info you gain from the first.
Case in point: If you get 3 green 2 yellow in the first word, you can solve on the next guess.
Of course, you can constrain your strategies to "I always use the same two/three starting words", and in many cases that will be fine. But it's quite obviously not optimal.
Also, the optimal strategy must depend on your goal metric. Do you go for "least average guesses", "least maximum guesses", or "least average guesses while never losing"? There's lots of unstated assumptions in all the analyses thrown around...
> It seems clear to me that there is an optimal starting word
I took one step further and calculated the optimal starting word for obtaining green matches (I assume that this also makes it likely to produce yellow matches, although I did not explicitly optimize for that).
Beginning with the full list of 5-letter words, I calculated the frequency of each letter of the alphabet in each of the 5 possible positions for a 5 letter word.
Then I iterated through the list a second time, this time assigning a score for each word equal to the sum of frequencies for each letter in its respective position.
By a significant margin, the highest score is SLATE (over 1400). Runners up (over 1300) are SAUTE, SHIRE, and CRATE.
Caveat: this approach assumes that all possible words are equally likely to be the answer.
Interestingly, doing a similar analysis where we bucketed each word into 243 (3^5) buckets based on the possible result, we found "RAISE" as the best word. Source[0]
I would say that the distribution of yellow letters in the remaining word determines the value. If you have a yellow "X" or "Q" or one yellow vowel after testing all 5 vowels, that is a lot more meaningful than a yellow "T", which could really go anywhere. A yellow "G" but grey "I" tells me (at least intuitively, not quantitatively) a bit more than a green "I" and grey "G".
A recent word had a single vowel. This made it reasonably easy to guess where those vowels could go and to also choose a consonant-heavy next word that would really tear apart the search space if even one consonant was valid.
I think I'll switch to using RAISE or SLATE as a first choice. It was ADIEU before, but there's almost never an A in slot 1, U in slot 5, or D in slot 2, and I don't feel like any of these letters has a significantly most common position with just the info from these 5 letters. The yellow/grey letters are helpful after ADIEU, but I still need to explore at least 4 more letters (including O, and usually N or C) to even get a good idea of where any matches might go. In the moments that letter 4 is "E" while guessing ADIEU on round 1, that actually doesn't help me much with round 2. In fact, then I'm faced with the decision to explore unguessed letters in general, words with two "E"s, and/or words with "E" in 4th position. (same with "I" in 3rd)
I've implemented your last suggestion of scoring words by how much they reduce the pool of potential candidates (12k+). Testing it against the 1,000 most frequent words (targets), the word LORES seems to work best on average.
Interesting. I've been using ORATE as it knocks out a lot of the top frequency letters from the english language as a whole. But I guess the distribution in five letter words, and more so in a subset of five letter words, may skew differently.
From what I saw, T shifts down noticeably when restricted to 5 letters exactly. Intuitively, it sort of make sense -- there are a ton of short T words. ('that' and friends).
> Beginning with the full list of 5-letter words, I calculated the frequency of each letter of the alphabet in each of the 5 possible positions for a 5 letter word.
There’s a great many 5-letter words that the creator will never consider because they’re too obscure. Treating all 5-letter words as possible is a mistake when calculating strategy for this.
High-frequency letters will be over-represented in your analysis.
Nice. I tried a simpler method of only looking at the wikipedia table of overall letter frequency in dictionaries (not in 5 letter words) and the vowels are high, especially E and I followed by A. Then I checked the frequeny of starting letter and that is led by "S". So my guess for a 5 letter word became one with at least two vowels from E,I,A, starts with an S, and has the remaining two characters from the reasonably high frequency set (R, T, N, etc). My guess then was SIREN which has that property. Just out of curiosity, if you still have the output from your program, how bad was SIREN compared to SLATE?
That is a really interesting approach -- if you remove the words with any of the letters in SLATE, what is the highest ranked word? (If you have it to hand)
> Case in point: If you get 3 green 2 yellow in the first word, you can solve on the next guess.
That's a very specific edge case, though.
I agree with you that the second word should vary depending on the first result, but picking the best first and second words require lots of analysis. The first word should be picked in order to open up the best options for second word, and while I believe the second word should not contain letters from the first, unless you got extremely lucky, the distribution of letters of remaining available words probably changes which letters you want to cover.
Failing to do the required analysis, my current strategy is to pick two words that cover the 10 most common letters in english, ETAOINSHRD. Sometimes it's "ethos" and "nadir", sometimes "thine" and "roads", etc. So far, it's worked well.
> It seems clear to me that there is an optimal starting word, but that the best second word has to depend on the info you gain from the first.
Definitely. The likelihood of a letter appearing in a given place changes depending on the letters around it. A Q will almost always be followed with a U, for example.
I wrote a script yesterday which spits out the relative probabilities of possible letters in each unknown position, given the current known/excluded letters -- it was interesting to see the effect in action.
Case in point: If you get 3 green 2 yellow in the first word, you can solve on the next guess.
Of course, you can constrain your strategies to "I always use the same two/three starting words", and in many cases that will be fine. But it's quite obviously not optimal.
Also, the optimal strategy must depend on your goal metric. Do you go for "least average guesses", "least maximum guesses", or "least average guesses while never losing"? There's lots of unstated assumptions in all the analyses thrown around...