INDEX
Explanations
phrases indicating a type or category, often associated with subjective descriptions
New Auto-Interp
Negative Logits
chg
-0.17
еÑĢÑĪ
-0.17
utter
-0.16
imizer
-0.16
uat
-0.14
POSSIBILITY
-0.14
imuth
-0.14
ÑĤÑİ
-0.14
remely
-0.14
distract
-0.13
POSITIVE LOGITS
like
0.21
-sort
0.21
semi
0.16
antity
0.16
Like
0.16
thing
0.16
-ÑĤаки
0.15
tw
0.15
LIKE
0.15
/s
0.15
Activations Density 0.031%