INDEX
Explanations
greetings and friendly openers
New Auto-Interp
Negative Logits
axiomatic
0.44
blame
0.40
Thoreau
0.40
alcoholism
0.39
事實
0.37
dementia
0.37
nonsense
0.37
egregious
0.36
Frankly
0.36
idlertid
0.36
POSITIVE LOGITS
awesome
0.71
Awesome
0.59
Hey
0.57
Awesome
0.53
hey
0.51
awesome
0.51
めっちゃ
0.50
hi
0.49
Hi
0.49
hola
0.48
Activations Density 0.001%