INDEX
Explanations
phrases describing newly acquired knowledge or information
expressions of acquired knowledge or information
New Auto-Interp
Negative Logits
abwe
-0.74
oided
-0.71
throats
-0.65
pers
-0.64
idity
-0.64
adies
-0.62
ankind
-0.62
ataka
-0.61
yip
-0.58
vell
-0.58
POSITIVE LOGITS
llor
0.90
Lear
0.81
ilage
0.79
lyn
0.77
çīĪ
0.75
æ©
0.75
firsthand
0.75
æĥ
0.73
CLASSIFIED
0.73
learn
0.72
Activations Density 0.028%