INDEX
Explanations
phrases indicating familiarity or awareness of information
New Auto-Interp
Negative Logits
mina
-0.16
adolu
-0.15
oplan
-0.15
idle
-0.15
itas
-0.14
зв
-0.14
ddit
-0.14
Cout
-0.14
.tc
-0.14
Ñħв
-0.13
POSITIVE LOGITS
ington
0.15
ÐĴÑĸд
0.15
ie
0.15
INGTON
0.14
IFO
0.14
hi
0.14
.Ac
0.14
edores
0.14
reck
0.14
kind
0.14
Activations Density 0.021%