INDEX
Explanations
words related to negative outcomes or distressing experiences
New Auto-Interp
Negative Logits
inoa
-0.74
Iw
-0.74
taboola
-0.73
Tsukuyomi
-0.70
Debor
-0.68
eva
-0.68
Ezek
-0.67
Niet
-0.66
Sonia
-0.65
âĵĺ
-0.64
POSITIVE LOGITS
icultural
0.81
aday
0.78
quarters
0.73
angs
0.70
lights
0.70
abee
0.66
sters
0.66
dog
0.66
acan
0.65
ogeneous
0.65
Activations Density 0.033%