INDEX
Explanations
phrases indicating a contrast to commonly held beliefs or assertions
New Auto-Interp
Negative Logits
ilitating
-0.83
lov
-0.79
oler
-0.78
killer
-0.76
isol
-0.75
urated
-0.75
beans
-0.75
aqu
-0.74
ubes
-0.74
aquin
-0.73
POSITIVE LOGITS
viewpoints
1.05
sides
0.89
opinions
0.88
counsel
0.81
sexes
0.79
shore
0.79
views
0.78
side
0.77
halves
0.77
minded
0.77
Activations Density 0.054%