INDEX
Explanations
words related to feelings of discomfort or negative experiences
New Auto-Interp
Negative Logits
bower
-0.17
/bower
-0.17
illac
-0.17
çŃĨ
-0.16
stor
-0.16
STYLE
-0.16
ATAR
-0.15
ulp
-0.15
firm
-0.15
ãĥ¡ãĥ³ãĥĪ
-0.15
POSITIVE LOGITS
/on
0.17
w
0.17
VStack
0.17
Leading
0.16
Sanity
0.16
jet
0.16
ippi
0.14
pr
0.14
Jet
0.14
Shaw
0.14
Activations Density 0.025%