Re: [PATCH] acpi: Avoid dropping rapid hotkey events (or other GPEs) on Asus EeePC

Previous thread: [PATCH 2/7] dynamic debug v2 - nfs conversion by Jason Baron on Tuesday, July 15, 2008 - 5:32 pm. (9 messages)

Next thread: [patch 00/17] Tracepoints v4 for linux-next by Mathieu Desnoyers on Tuesday, July 15, 2008 - 6:26 pm. (3 messages)
To: <astarikovskiy@...>
Cc: <linux-acpi@...>, <linux-kernel@...>
Date: Tuesday, July 15, 2008 - 6:25 pm

It looks like this EC clears the SMI_EVT bit after every query, even if there
are more events pending. The workaround is to repeatedly query the EC until
it reports that no events remain.

This fixes a regression in 2.6.26 (from 2.6.25.3). Initially reported as
"Asus Eee PC hotkeys stop working if pressed quickly" in bugzilla
<http://bugzilla.kernel.org/show_bug.cgi?id=11089>.

The regression was caused by a recently added check for interrupt storms.
The Eee PC triggers this check and switches to polling. When multiple events
arrive between polling intervals, only one is fetched from the EC. This causes
erroneous behaviour; ultimately events stop being delivered altogether when the
EC buffer overflows.

Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>

---
diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index 5622aee..2b4c5a2 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -459,14 +459,10 @@ void acpi_ec_remove_query_handler(struct acpi_ec *ec, u8 query_bit)

EXPORT_SYMBOL_GPL(acpi_ec_remove_query_handler);

-static void acpi_ec_gpe_query(void *ec_cxt)
+static void acpi_ec_gpe_run_handler(struct acpi_ec *ec, u8 value)
{
- struct acpi_ec *ec = ec_cxt;
- u8 value = 0;
struct acpi_ec_query_handler *handler, copy;

- if (!ec || acpi_ec_query(ec, &value))
- return;
mutex_lock(&ec->lock);
list_for_each_entry(handler, &ec->list, node) {
if (value == handler->query_bit) {
@@ -484,6 +480,18 @@ static void acpi_ec_gpe_query(void *ec_cxt)
mutex_unlock(&ec->lock);
}

+static void acpi_ec_gpe_query(void *ec_cxt)
+{
+ struct acpi_ec *ec = ec_cxt;
+ u8 value = 0;
+
+ if (!ec)
+ return;
+
+ while (!acpi_ec_query(ec, &value))
+ acpi_ec_gpe_run_handler(ec, value);
+}
+
static u32 acpi_ec_gpe_handler(void *data)
{
acpi_status status = AE_OK;

--

To: <alan-jenkins@...>
Cc: <linux-acpi@...>, linux-kernel <linux-kernel@...>
Date: Thursday, July 17, 2008 - 10:35 am

Hi Alan,

Could you please test if your patch works with the last patch in #10919?

Thanks,
Alex.

--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: <linux-acpi@...>, linux-kernel <linux-kernel@...>
Date: Thursday, July 17, 2008 - 12:02 pm

Vacuously so.

My patch still applies, but #10919 makes it obsolete. My patch fixed a
bug that shows up in polling mode. #10919 kills polling mode.

I've tested v2.6.26 + #10919 and it works fine (except for spamming the
kernel log - please read my Bugzilla comment).

It appears that interrupt mode suffered from a race which is very
similar to my original problem. If two GPE interrupts arrive before the
workqueue runs, then the second interrupt will be ignored because
EC_FLAGS_QUERY_PENDING is still set. This will happen with any EC if
interrupts are very close together, right?

I think my patch also fixes this theoretical problem. But I'd rather
you took over on this. I was already confused by ec.c in v2.6.26, and
with #10919 I understand it even less. E.g. why is
ec_switch_to_poll_mode() still present; what does it do now do_ec_poll()
is removed?

I'm happy to work on this with you, but I'd need to be able understand
the code first :-(.

Alan
--

To: Alan Jenkins <alan-jenkins@...>
Cc: <linux-acpi@...>, linux-kernel <linux-kernel@...>
Date: Thursday, July 17, 2008 - 12:45 pm

Not so, there are two polls in ec.c, first is poll for change in status register,
which gave the name to the mode, and still exists; the other is for event
in embedded controller, which was introduced to properly solve #9998, and part of
The notion of queue in embedded controller is new, it was never mentioned in
ACPI spec, so the driver was written with assumption that new query interrupt should
not arrive before we service previous one. There is even a chart in how interrupts
I think, this is not a theoretical problem, but the problem we've tried to solve in
See above, I still disable EC GPE for the time than we have pending query,

Thanks again,
Alex.
--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: <linux-acpi@...>, linux-kernel <linux-kernel@...>, Henrique de Moraes Holschuh <hmh@...>
Date: Thursday, July 17, 2008 - 2:55 pm

OK, I'm happy now.

However, I'm now worried that I broke the semantics for
EC_FLAGS_QUERY_PENDING. In my patch it gets cleared after the first
query, even though I'm running queries in a loop until nothing is left.
It doesn't cause a problem in my patch, but it's unclean and might cause
confusion later on. It'd be better to clear it after the loop has
completed.

Can I fix my patch? If you ACK the new code below, then I'll resend
with a proper changelog, S-o-B, CC's from the -mm mail (including
stable@kernel.org) and grovel to akpm, etc.

You're latest (quieter) work still applies on top and works fine.

Thanks
Alan

---

From: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Tested-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>

diff --git a/drivers/acpi/ec.c b/drivers/acpi/ec.c
index 5622aee..2a42392 100644
--- a/drivers/acpi/ec.c
+++ b/drivers/acpi/ec.c
@@ -230,8 +230,7 @@ static int acpi_ec_transaction_unlocked(struct acpi_ec *ec, u8 command,
"finish-write timeout, command = %d\n", command);
goto end;
}
- } else if (command == ACPI_EC_COMMAND_QUERY)
- clear_bit(EC_FLAGS_QUERY_PENDING, &ec->flags);
+ }

for (; rdata_len > 0; --rdata_len) {
result = acpi_ec_wait(ec, ACPI_EC_EVENT_OBF_1, force_poll);
@@ -459,14 +458,10 @@ void acpi_ec_remove_query_handler(struct acpi_ec *ec, u8 query_bit)

EXPORT_SYMBOL_GPL(acpi_ec_remove_query_handler);

-static void acpi_ec_gpe_query(void *ec_cxt)
+static void acpi_ec_gpe_run_handler(struct acpi_ec *ec, u8 value)
{
- struct acpi_ec *ec = ec_cxt;
- u8 value = 0;
struct acpi_ec_query_handler *handler, copy;

- if (!ec || acpi_ec_query(ec, &value))
- return;
mutex_lock(&ec->lock);
list_for_each_entry(handler, &ec->list, node) {
if (value == handler->query_bit) {
@@ -484,6 +479,20 @@ static void acpi_ec_gpe_query(void *ec_cxt)
mutex_unlock(&ec->lock);
}

+static void acpi_ec_gpe_query(void *ec_cxt)
+{
+ struct acpi_ec *ec = ec_cx...

To: Alan Jenkins <alan-jenkins@...>
Cc: <linux-acpi@...>, linux-kernel <linux-kernel@...>, Henrique de Moraes Holschuh <hmh@...>
Date: Thursday, July 17, 2008 - 2:59 pm

Thanks,

--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: <alan-jenkins@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, Maximilian Engelhardt <maxi@...>
Date: Tuesday, August 12, 2008 - 7:28 pm

Did this get fixed yet?

I have an patch in -mm which I just restored (I had to tempdrop it
because the acpi tree was busted for some time). But it seems to be
old.

http://bugzilla.kernel.org/show_bug.cgi?id=10919 is marked "resolved"
but the reporter (Maximilian) seems to think otherwise. 2.6.26.x is,
afaik, still unfixed, as is 2.6.27-rc.

Thanks.
--

To: Andrew Morton <akpm@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, Maximilian Engelhardt <maxi@...>, <linux-stable@...>
Date: Wednesday, August 13, 2008 - 6:21 am

That's correct. I think this specific patch should go in 2.6.27 and
2.6.26-stable. No objections have been raised so far.

I still need this patch to make my brightness and volume control keys
usable in 2.6.27-rc3. (They auto-repeat fast enough to trigger the
bug). This is true even after applying the latest patches from bug
10919 (#25 + #27).

I think the 10919 fix makes it harder to reproduce, but it definitely
still happens. I guess this is because the polling-driven EC
transactions add 1ms delays between each byte. The slower timings leave
a window where the buggy behaviour of my EC can make a difference. (It
has been seen to clear the "pending event" bit after a single event is
read, despite having more events pending).

There are more serious consequences of this bug. After a while it can
confuse the EC enough to cause lockups or reboots during boot, or after
pressing a single hotkey. This bad state is preserved over reboots,
even into known good kernels. Fortunately the badness clears when power
is removed for a long enough period. For a while I was worried that
something had physically burnt out.

Thanks
Alan
--

To: Alan Jenkins <alan-jenkins@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, Maximilian Engelhardt <maxi@...>, <linux-stable@...>
Date: Wednesday, August 13, 2008 - 6:46 am

Oh gad. And there's no workaround?
--

To: Andrew Morton <akpm@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, Maximilian Engelhardt <maxi@...>, <stable@...>
Date: Wednesday, August 13, 2008 - 7:51 am

[Dupe apology: CC'd to stable@kernel.org, with the right address this time]

Sorry, that was confusing.

The patch in currently in -mm _is_ the workaround for this damage. It
was not initially obvious just how important it was :-). I've
re-attached it as requested.

10919, "laggy hotkeys" is just what it says; ACPI EC events are slower
because of polling. It appears to be a more cosmetic issue which is
orthogonal to the _dropping_ of events.

Thanks
Alan

To: Alan Jenkins <alan-jenkins@...>
Cc: Andrew Morton <akpm@...>, Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, <stable@...>
Date: Wednesday, August 13, 2008 - 9:36 am

This patch doesn't fix my problem (bug 10919), it only changes it a bit.
When I press the dimming key and hold it pressed the display should dim=20
up/down step by step as long as I hold the key pressed (that's how it has=20
always been).
I'm now running a 2.6.27-rc3 kernel with your patch applied and the display=
=20
does dim as described below.
When I press the dimming key first the display brightness doesn't change at=
=20
all, then it jumps multiple steps, stays there for a short while and again=
=20
jumps some steps till it reaches the end of the dimming interval.
It looks like the key presses (assuming holding down the key does generate=
=20
multiple key presses) get queued up, then all processed all at once, then=20
again queued up...

Maxi

To: Maximilian Engelhardt <maxi@...>
Cc: Andrew Morton <akpm@...>, Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, <stable@...>
Date: Wednesday, August 13, 2008 - 10:39 am

This is expected behaviour - I can explain it in detail if that helps.
I see the same on my laptop, though perhaps it is more annoying on yours.

The patch I've submitted here is not intended to fix #10919. You said
the patch at <http://bugzilla.kernel.org/show_bug.cgi?id=10919#c21>
worked for you, and I think the approach there is the way forward for
this "laggy" issue. However I don't know if it will be ready for this
kernel release. I pointed out a cosmetic issue with it to Alexey but
haven't heard anything since.

Alan
--

To: Andrew Morton <akpm@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>, <hmh@...>, Maximilian Engelhardt <maxi@...>, <linux-stable@...>
Date: Wednesday, August 13, 2008 - 7:45 am

Sorry, that was confusing.

The patch in currently in -mm _is_ the workaround for this damage. It
was not initially obvious just how important it was :-). I've
re-attached it as requested.

10919, "laggy hotkeys" is just what it says; ACPI EC events are slower
because of polling. It appears to be a more cosmetic issue which is
orthogonal to the _dropping_ of events.

Thanks
Alan

To: <alan-jenkins@...>
Cc: <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 7:49 am

Hi Alan,

Thanks for the patch, ACK.

Regards,
Alex.

--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: <alan-jenkins@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 8:13 am

This one fixes a potentially bad problem, since we could ignore more than

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--

To: Henrique de Moraes Holschuh <hmh@...>
Cc: <alan-jenkins@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 8:30 am

I vote for it

Regards,
Alex.
--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: <alan-jenkins@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 12:26 pm

Well, in that case, it would be best to tack a Cc: stable@kernel.org git
footer to it right away, I think.

IMHO, it would also be nice if the commit message made it more clear that
the issue it solves can affect much more serious ACPI events than just hot
key presses.

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--

To: Henrique de Moraes Holschuh <hmh@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 12:45 pm

Actually Alexey has another patch in bugzilla (#10919) which resolves
this issue in a better way. It avoids polling altogether, which is good
because it means you get events immediately. My laptop has backlight
adjustment hotkeys with hardware autorepeat, so it looks really jerky
with polling.

So I think I should withdraw my patch and leave this to Alexey. I've
tested his fix on my laptop and it works fine. It needs some more work
though - e.g. at the moment it spams the kernel log.

Thanks,
Alan
--

To: Alan Jenkins <alan-jenkins@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 2:50 pm

We need to get ALL pending events from the EC, whether in polling mode or in
interrupt mode (and we must make sure that we ARE going to queue any new
interrupt before we stop checking for an empty pending event queue,
otherwise there is a small window where a new event might get queued, and we
are still masking the GPE and don't notice it). As long as the fix does
that, it is all I care :-)

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh
--

To: Henrique de Moraes Holschuh <hmh@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Thursday, July 17, 2008 - 3:07 pm

OK, I forgot about that. I need to do some reading on kernel
programming, workqueues in particular, and redo my patch again. Sorry,
Alexey! AFAICS there's still this (very) small window where events could
be dropped.

Alan
--

To: Alexey Starikovskiy <astarikovskiy@...>
Cc: Henrique de Moraes Holschuh <hmh@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Saturday, July 19, 2008 - 7:37 am

Here's what I came up with -

1. I was fighting against EC_FLAGS_QUERY_PENDING. This was used to ignore
multiple successive GPE interrupts and treat them as a single GPE instead.
That's the exact opposite of what we want to do. Let's get rid of it.

2. Then we can apply my original patch to fix GPE polling on the Asus EeePC,
by repeatedly querying for GPEs until there are none left.

3. Finally, if I'm right then we now know how to handle "GPE interrupt storms".
Some EC's are raising multiple interrupts before we acknowledge them. but
they're just telling us how many events are pending. There's no harm in
that, so we don't ever need to disable GPE interrupts. Let's get rid of
GPE polling mode. (Code mainly stolen from Alexey).

Patch 3 would benefit from wider testing. Fortunately there are several open
bugs about GPEs so it should be easy to find testers :-).

Alan

--

To: Alan Jenkins <alan-jenkins@...>
Cc: Alexey Starikovskiy <astarikovskiy@...>, Henrique de Moraes Holschuh <hmh@...>, <linux-acpi@...>, <linux-kernel@...>
Date: Saturday, July 19, 2008 - 10:07 am

On Sat, Jul 19, 2008 at 1:37 PM, Alan Jenkins

Hi,

I have seen the "GPE storm" message before but it had no apparent
side-effects. Your patches do not seem to change this, although the
message is now gone. Thanks! (I have an Acer Aspire 5720 laptop.)

Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--

Previous thread: [PATCH 2/7] dynamic debug v2 - nfs conversion by Jason Baron on Tuesday, July 15, 2008 - 5:32 pm. (9 messages)

Next thread: [patch 00/17] Tracepoints v4 for linux-next by Mathieu Desnoyers on Tuesday, July 15, 2008 - 6:26 pm. (3 messages)