INDEX
Explanations
expressions of relief or satisfaction
New Auto-Interp
Negative Logits
ciplinary
-0.77
pes
-0.75
akable
-0.74
cend
-0.68
cffffcc
-0.65
abre
-0.64
cum
-0.63
effic
-0.62
artifacts
-0.62
helicop
-0.62
POSITIVE LOGITS
they
0.74
somebody
0.70
tid
0.69
sonian
0.67
othe
0.65
terday
0.65
imaru
0.64
you
0.64
THEY
0.63
someone
0.63
Activations Density 0.081%