INDEX
Explanations
words related to personal judgment or discretion
concepts related to personal autonomy and decision-making
New Auto-Interp
Negative Logits
ayn
-0.65
Sov
-0.59
referen
-0.57
ILA
-0.55
MSN
-0.54
yah
-0.54
mbuds
-0.53
corrid
-0.53
Ther
-0.52
MAP
-0.52
POSITIVE LOGITS
;)
0.89
persuasion
0.87
:)
0.79
!.
0.78
imaginable
0.78
_.
0.76
drive
0.73
:-)
0.72
*.
0.72
.<
0.70
Activations Density 0.442%