INDEX
Explanations
language that indicates evidence and support in arguments or claims
New Auto-Interp
Negative Logits
adle
-0.17
uck
-0.16
atin
-0.15
ives
-0.15
Claud
-0.15
Sims
-0.14
plat
-0.14
erty
-0.14
otty
-0.14
Barth
-0.14
POSITIVE LOGITS
ocache
0.16
andest
0.14
alu
0.14
emmel
0.14
olest
0.14
Unc
0.14
جÙĪ
0.14
cola
0.14
çĮ
0.14
ç
0.14
Activations Density 0.458%