INDEX
Explanations
descriptions of things as "nice"
the repeated use of the word "nice."
New Auto-Interp
Negative Logits
arians
-0.79
ogens
-0.79
authorized
-0.79
arers
-0.76
arian
-0.75
ochond
-0.75
inant
-0.74
rained
-0.74
uilding
-0.71
igate
-0.69
POSITIVE LOGITS
nice
0.92
bye
0.86
fluffy
0.81
bye
0.80
touches
0.78
additions
0.78
little
0.75
enough
0.75
neat
0.74
bonus
0.74
Activations Density 0.019%