INDEX
Explanations
the word "nothing" followed by high activations
negative assertions and phrases emphasizing nullity or insignificance
New Auto-Interp
Negative Logits
PLA
-0.67
landers
-0.63
uctions
-0.59
eus
-0.59
Bots
-0.58
transitions
-0.58
decline
-0.57
eton
-0.57
prohibitions
-0.57
downs
-0.57
POSITIVE LOGITS
lled
0.86
avering
0.75
bered
0.74
umbn
0.73
arily
0.73
ient
0.72
ĸļ
0.69
akin
0.69
itter
0.68
ozyg
0.68
Activations Density 0.100%