INDEX
Explanations
phrases related to fairness and justice
New Auto-Interp
Negative Logits
apse
-0.86
CHAT
-0.83
hent
-0.76
OPLE
-0.74
Assembly
-0.68
uality
-0.68
acid
-0.65
OUS
-0.63
artifacts
-0.62
hal
-0.61
POSITIVE LOGITS
yt
1.15
grounds
1.07
fair
1.02
itably
0.88
iciary
0.87
ground
0.85
compensation
0.78
child
0.77
trade
0.75
fair
0.73
Activations Density 0.642%