INDEX
Explanations
questions introduced with the word "What"
questions formulated with "What"
New Auto-Interp
Negative Logits
shore
-0.66
lich
-0.65
general
-0.64
roads
-0.64
ulic
-0.63
Lago
-0.63
ability
-0.63
println
-0.62
gi
-0.60
raction
-0.59
POSITIVE LOGITS
soever
1.24
happens
1.04
Lies
0.98
happened
0.94
distinguishes
0.94
transpired
0.92
Makes
0.90
separates
0.88
Difference
0.85
happ
0.83
Activations Density 0.081%