INDEX
Explanations
words related to issues, problems, or potential threats
occurrences of the word "were."
New Auto-Interp
Negative Logits
dom
-0.65
Matter
-0.65
iates
-0.64
Defeat
-0.62
Raise
-0.62
otic
-0.62
ledge
-0.61
oire
-0.61
place
-0.58
lag
-0.58
POSITIVE LOGITS
wolves
1.53
wolf
1.29
able
0.97
nt
0.96
supposed
0.88
instrumental
0.87
hes
0.85
originally
0.85
hers
0.84
greeted
0.83
Activations Density 0.223%