INDEX
Explanations
words related to criticism and negative consequences
New Auto-Interp
Negative Logits
inarily
-0.69
craft
-0.66
icipated
-0.64
riots
-0.64
issance
-0.64
76561
-0.63
hyde
-0.62
ords
-0.62
igi
-0.60
1886
-0.60
POSITIVE LOGITS
much
0.89
busy
0.84
far
0.83
tempting
0.79
risky
0.79
many
0.77
afraid
0.77
oths
0.76
distracting
0.74
ls
0.72
Activations Density 0.326%