INDEX
Explanations
instances of emotional or expressive language
New Auto-Interp
Negative Logits
Porno
-0.19
ubo
-0.18
ventus
-0.15
ceries
-0.15
iterr
-0.14
hots
-0.14
inness
-0.14
.Fat
-0.14
adiens
-0.14
.useState
-0.14
POSITIVE LOGITS
con
0.14
MPI
0.14
0.14
meaning
0.14
meaning
0.14
argent
0.13
_vocab
0.13
Cobb
0.13
tying
0.13
legit
0.13
Activations Density 0.008%