INDEX
Explanations
a frequent term or concept, specifically related to significant individuals or noteworthy events
New Auto-Interp
Negative Logits
vlo
-0.61
Rule
-0.58
ne
-0.57
._
-0.54
poco
-0.53
.
-0.53
<eos>
-0.51
so
-0.51
kirch
-0.51
Rule
-0.51
POSITIVE LOGITS
AddTagHelper
1.09
Efq
0.99
)");
0.94
itſelf
0.94
raiſ
0.93
Jefus
0.93
becauſe
0.92
>");
0.90
)";
0.89
$.
0.89
Activations Density 0.111%