INDEX
Explanations
phrases related to responsibility or critique towards particular individuals
New Auto-Interp
Negative Logits
urations
-0.60
Esp
-0.60
Bron
-0.58
orth
-0.58
ibaba
-0.58
Membership
-0.57
Shutterstock
-0.57
Trinity
-0.56
tesque
-0.56
Louis
-0.55
POSITIVE LOGITS
responsible
0.89
liest
0.84
deciding
0.82
happiest
0.79
who
0.73
abet
0.72
responsible
0.72
reacting
0.71
initiating
0.69
risking
0.69
Activations Density 0.098%