INDEX
Explanations
fair wages, compensation, or admissions
New Auto-Interp
Negative Logits
忘记
0.46
evoking
0.45
晓
0.45
ignore
0.45
corrobor
0.43
компании
0.43
忽略
0.43
общества
0.43
boxylate
0.43
imaginations
0.42
POSITIVE LOGITS
fairness
0.69
Fairness
0.57
fairer
0.55
fair
0.50
Fair
0.47
fair
0.44
tax
0.44
unfair
0.43
tema
0.42
tema
0.42
Activations Density 0.010%