INDEX
Explanations
terms and phrases related to deception or manipulation
New Auto-Interp
Negative Logits
adapta
-0.72
inspira
-0.67
сделали
-0.65
нашли
-0.63
делают
-0.62
orienta
-0.62
coinciden
-0.62
representa
-0.61
interpreta
-0.61
combina
-0.60
POSITIVE LOGITS
poffe
0.85
raiſ
0.80
MethodManager
0.74
atsi
0.73
deſt
0.72
AndEndTag
0.70
%";
0.69
cknow
0.68
etzal
0.68
herum
0.67
Activations Density 1.154%