INDEX
Explanations
references to similarity or sameness
New Auto-Interp
Negative Logits
rawler
-0.14
aro
-0.13
asures
-0.13
ilation
-0.12
ืà¸Ń
-0.12
ILA
-0.12
Lap
-0.12
izable
-0.12
finally
-0.12
orna
-0.12
POSITIVE LOGITS
same
0.84
same
0.78
Same
0.67
Same
0.65
SAME
0.62
åIJĮ
0.60
_same
0.59
SAME
0.59
缸åIJĮ
0.57
mismo
0.56
Activations Density 0.140%