INDEX
Explanations
statements related to legal or moral implications and consequences
phrases indicating reasonable beliefs related to danger and injury
New Auto-Interp
Negative Logits
DK
-0.58
Patreon
-0.56
ahime
-0.56
MK
-0.55
iets
-0.55
MSN
-0.55
Knight
-0.54
Jed
-0.54
Dialogue
-0.54
emonium
-0.54
POSITIVE LOGITS
).[
0.76
Downloadha
0.73
)).
0.72
harm
0.69
").
0.68
)."
0.67
biological
0.62
or
0.62
azo
0.61
detriment
0.61
Activations Density 2.096%