INDEX
Explanations
concepts related to social justice and equity issues, particularly around fairness and inequality
New Auto-Interp
Negative Logits
away
-0.18
onas
-0.17
alem
-0.16
NotAllowed
-0.15
Learned
-0.14
spoof
-0.14
ιÏĩ
-0.14
ertia
-0.14
Ulus
-0.14
ippy
-0.13
POSITIVE LOGITS
fail
0.39
fails
0.38
overlook
0.37
miss
0.36
misses
0.36
neglect
0.33
miss
0.31
ignore
0.30
ignores
0.29
masks
0.28
Activations Density 0.312%