INDEX
Explanations
verbal commands or instructions starting with "don't."
negations or expressions of inability
New Auto-Interp
Negative Logits
behavi
-0.80
tremend
-0.77
EStream
-0.76
mosqu
-0.75
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
-0.73
gorilla
-0.69
intern
-0.69
cannabin
-0.68
exha
-0.66
Skydragon
-0.66
POSITIVE LOGITS
ween
1.07
aken
1.06
otally
1.04
asks
1.04
ruck
1.03
ractor
1.02
olkien
1.01
akers
1.00
ople
0.97
une
0.97
Activations Density 0.127%