INDEX
Explanations
conjunctions and words indicating connections or relationships
New Auto-Interp
Negative Logits
19
-0.06
29
-0.06
55
-0.06
vy
-0.06
48
-0.06
hakk
-0.06
Hans
-0.06
Hakk
-0.06
54
-0.06
53
-0.06
POSITIVE LOGITS
full
0.07
full
0.07
eh
0.07
isté
0.06
rost
0.06
ew
0.06
ãĥ³ãĥĸ
0.06
erra
0.06
veral
0.06
orian
0.06
Activations Density 0.018%