INDEX
Explanations
phrases related to interactions with authority figures
phrases relating to balance and decision-making
New Auto-Interp
Negative Logits
respectively
-0.83
}.
-0.81
?).
-0.73
`.
-0.72
)).
-0.72
*.
-0.71
.).
-0.70
'.
-0.68
$.
-0.68
+.
-0.67
POSITIVE LOGITS
his
0.55
himself
0.54
wheelchair
0.48
resignation
0.48
apologise
0.47
cohol
0.47
Leeds
0.46
anus
0.46
virginity
0.46
composure
0.45
Activations Density 1.939%