INDEX
Explanations
phrases indicating past usage and engagement in activities
New Auto-Interp
Negative Logits
featureID
-0.60
énario
-0.55
reszcie
-0.52
Портали
-0.51
HasAnnotation
-0.49
Predecesor
-0.46
üstü
-0.44
désolés
-0.44
protoimpl
-0.43
got
-0.43
POSITIVE LOGITS
autorytatywna
0.44
Autorizaciones
0.39
OGND
0.38
fören
0.38
feroit
0.38
singoli
0.35
SharedCtor
0.34
Constitu
0.34
<bos>
0.33
Hochspringen
0.33
Activations Density 0.152%