INDEX
Explanations
words related to emphasizing a conclusion or consequence
New Auto-Interp
Negative Logits
Defenders
-0.71
Polo
-0.68
igger
-0.64
metro
-0.59
ridges
-0.56
Ones
-0.56
Klu
-0.55
Beaver
-0.55
abies
-0.55
steroids
-0.55
POSITIVE LOGITS
forth
1.27
entimes
0.96
far
0.94
forward
0.91
ly
0.89
far
0.84
mask
0.76
lessly
0.74
othe
0.71
ç¥ŀ
0.70
Activations Density 0.021%