INDEX
Explanations
instances of actions or statements that could be seen as harmful or disruptive
phrases that convey action or inquiry
New Auto-Interp
Negative Logits
nces
-0.75
abouts
-0.71
sequent
-0.71
etheless
-0.70
webkit
-0.68
Keys
-0.68
thereafter
-0.66
tions
-0.66
iann
-0.65
none
-0.62
POSITIVE LOGITS
wrong
0.93
Wrong
0.84
delusional
0.81
something
0.80
mischief
0.80
miscon
0.78
hypocr
0.74
hay
0.73
backwards
0.72
misunderstood
0.72
Activations Density 0.471%