INDEX
Explanations
phrases related to strength and weakness
words and phrases related to perceptions of weakness and strength
New Auto-Interp
Negative Logits
mentioned
-0.67
tions
-0.65
iland
-0.62
tails
-0.59
=#
-0.59
ancies
-0.56
anooga
-0.56
sequently
-0.55
undo
-0.54
arton
-0.53
POSITIVE LOGITS
underdog
0.62
savior
0.61
rog
0.60
inferior
0.60
discipl
0.60
chwitz
0.57
".[
0.57
pic
0.56
coward
0.56
utilitarian
0.56
Activations Density 0.738%