INDEX
Explanations
phrases indicating interaction or engagement with others
New Auto-Interp
Negative Logits
PIO
-0.20
urat
-0.17
.Uri
-0.17
ãĤ¤ãĤº
-0.16
.sul
-0.16
amework
-0.16
견
-0.16
lico
-0.16
lon
-0.15
getc
-0.15
POSITIVE LOGITS
A
0.15
EC
0.15
ror
0.15
ue
0.15
noticed
0.15
els
0.14
409
0.14
ustr
0.14
obj
0.14
HQ
0.14
Activations Density 0.025%