INDEX
Explanations
phrases indicating the concept of reasonableness or fairness
New Auto-Interp
Negative Logits
icket
-0.73
chet
-0.70
chu
-0.70
planes
-0.69
yi
-0.69
frey
-0.69
berries
-0.68
stals
-0.65
cart
-0.64
flower
-0.63
POSITIVE LOGITS
tarian
0.98
acies
0.82
inference
0.76
excuse
0.74
precaution
0.72
expectation
0.71
soType
0.70
ufact
0.70
justification
0.70
itably
0.70
Activations Density 0.959%