INDEX
Explanations
descriptions of situations or actions
phrases relating to control and power dynamics
New Auto-Interp
Negative Logits
âĢ
-0.88
20439
-0.87
âĢ
-0.87
\"
-0.87
âĹ
-0.82
ãĢ
-0.82
âĢIJ
-0.81
¨
-0.80
.","
-0.80
"},"
-0.80
POSITIVE LOGITS
stuff
1.14
goddamn
1.13
dudes
1.12
kinda
1.11
weird
1.08
pretty
1.08
dude
1.07
shit
1.07
crap
1.06
shitty
1.06
Activations Density 1.628%