INDEX
Explanations
phrases indicating actions or progress towards goals
New Auto-Interp
Negative Logits
awe
-0.18
arring
-0.15
.ast
-0.15
enville
-0.15
olley
-0.15
ocache
-0.15
ocytes
-0.15
olls
-0.14
portun
-0.14
contrast
-0.14
POSITIVE LOGITS
BV
0.15
sey
0.14
ëĵĿ
0.14
asca
0.14
ãģĹãģĭ
0.14
PLAIN
0.14
dam
0.14
YNC
0.14
Unsafe
0.13
PLICIT
0.13
Activations Density 0.037%