INDEX
Explanations
references to actions or statuses regarding people and their societal roles
New Auto-Interp
Negative Logits
ackbar
-0.18
atform
-0.16
icans
-0.14
deaux
-0.14
ulaire
-0.14
olumbia
-0.14
olis
-0.14
dio
-0.14
erç
-0.14
ccione
-0.14
POSITIVE LOGITS
indeed
0.15
ãi
0.15
Sharing
0.15
ää
0.15
uela
0.15
sharing
0.15
sharing
0.14
ulle
0.14
isch
0.14
only
0.14
Activations Density 0.011%