INDEX
Explanations
instances of the word "we" in various contexts
New Auto-Interp
Negative Logits
a
-0.81
')],
-0.81
the
-0.79
"
-0.77
ICAGO
-0.76
Griffin
-0.75
)")
-0.74
podjela
-0.73
"):
-0.73
)");
-0.72
POSITIVE LOGITS
Thebes
0.84
trypto
0.70
Jof
0.69
enfans
0.67
houſe
0.66
Howe
0.65
Canva
0.65
Holocene
0.65
heretics
0.64
Hereford
0.64
Activations Density 0.685%