INDEX
Explanations
phrases associated with violent or conflict-related events or contexts
words related to identities and categories
New Auto-Interp
Negative Logits
enegger
-0.84
elf
-0.82
raised
-0.79
orate
-0.73
oken
-0.71
raising
-0.68
CHA
-0.67
irth
-0.64
ioned
-0.63
Broken
-0.62
POSITIVE LOGITS
idal
1.43
pend
0.76
ãĥ³ãĤ¸
0.69
ãĥ¥
0.69
ity
0.67
ysis
0.67
atory
0.66
oad
0.66
itous
0.65
ITIES
0.64
Activations Density 0.010%