INDEX
Explanations
proper nouns related to people
names or terms associated with specific individuals or entities
New Auto-Interp
Negative Logits
feder
-0.80
ACTIONS
-0.73
é¾įå
-0.70
pse
-0.64
UTERS
-0.64
akeru
-0.60
AUD
-0.59
ĨĴ
-0.59
Helpful
-0.59
theless
-0.58
POSITIVE LOGITS
aten
0.78
iman
0.74
edi
0.73
nen
0.70
azi
0.67
ati
0.66
olean
0.65
puff
0.65
angs
0.64
har
0.64
Activations Density 0.281%