INDEX
Explanations
references to specific groups or categories
New Auto-Interp
Negative Logits
itſelf
-0.95
ſelf
-0.81
Theſe
-0.80
purpoſe
-0.79
ItemBackground
-0.79
variés
-0.78
AccessorTable
-0.78
ſeveral
-0.75
Monfieur
-0.74
的其他
-0.74
POSITIVE LOGITS
two
0.76
two
0.68
تين
0.63
respectively
0.61
Two
0.60
Two
0.58
beiden
0.57
+#+#
0.54
respectively
0.53
deux
0.53
Activations Density 0.680%