Work unit pegged at 100% CPU, 64 bit Debian
log in

Advanced search

Message boards : Number crunching : Work unit pegged at 100% CPU, 64 bit Debian

Author Message
Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1020 - Posted: 21 Apr 2013, 21:14:40 UTC

Application: ICT Protein Structure Prediction(2nd Generation) 1.21
Workunit name: ICT_5658_12
Received: Sun 21 Apr 2013 17:26:51 BST
Deadline: Wed 24 Apr 2013 17:35:05 BST



name ICT_5658_12
application ICT Protein Structure Prediction(2nd Generation)
created 21 Apr 2013, 16:34:36 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 8, 12, 4

http://casathome.ihep.ac.cn/workunit.php?wuid=9490502


I've tried suspending the work unit, it's still running at 100% of one of my cores. This is a 64 bit Debian Linux box running Testing.

% boinc --version
7.0.36 x86_64-pc-linux-gnu

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 1023 - Posted: 23 Apr 2013, 9:46:38 UTC - in response to Message 1020.

It appears to be already aborted by the user.
http://casathome.ihep.ac.cn/result.php?resultid=20333842

I think there is a latency between aborting in from the BOINC GUI and real application being killed on the host.
____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1025 - Posted: 23 Apr 2013, 21:10:27 UTC - in response to Message 1023.
Last modified: 23 Apr 2013, 21:11:07 UTC

I had to kill it by process ID in the end to stop it - it was cooking my computer :-)

I'll let you know if I come across any more problematic ones - please tell me if there is any more information I can provide to help debug.

Thanks,
Mike.

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 1031 - Posted: 25 Apr 2013, 6:44:57 UTC - in response to Message 1025.

Thanks for the information.
I will see if I can reproduce it on my Linux machine!


I had to kill it by process ID in the end to stop it - it was cooking my computer :-)

I'll let you know if I come across any more problematic ones - please tell me if there is any more information I can provide to help debug.

Thanks,
Mike.


____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1034 - Posted: 27 Apr 2013, 10:26:16 UTC
Last modified: 27 Apr 2013, 11:19:31 UTC

Hello, I have another one I'm afraid...

ICT_6844_26

Received Sat 27 Apr 2013 06:37:20 BST
Deadline Sun 28 Apr 2013 18:45:53 BST

Same behaviour as before, one core running at 100% even if task is suspended.

I'll try separating from the project, then reattaching.

Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1037 - Posted: 28 Apr 2013, 2:27:18 UTC - in response to Message 1034.
Last modified: 28 Apr 2013, 2:27:56 UTC

I detached the machine from the project, and reattached. Three more tasks downloaded, and each used 100% of a core.

I've suspended the project and killed the tasks.

What should I do?

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 1075 - Posted: 6 May 2013, 2:41:21 UTC - in response to Message 1037.

This used to be known as a bug in the BOINC wrapper which controls the running of application, we fixed this bug a long time ago. I tried on a Red Hat Linux machine with "suspend" and "resume", and it works well on it. What I can suggest is to use the most recent BOINC client to see how it works.


I detached the machine from the project, and reattached. Three more tasks downloaded, and each used 100% of a core.

I've suspended the project and killed the tasks.

What should I do?


____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 1076 - Posted: 6 May 2013, 2:46:15 UTC - in response to Message 1037.

Another thought: Maybe your host downloaded more jobs than the number of your CPU cores, so when you suspend one job, anther waiting job kicks in and use the same core and keeps the core occupied.
On the server side, we allow clients download more jobs than the number of the CPU cores as more users prefer to download more and run them while the hosts get offline.
If you want to free the CPU cores, either you can specify less CPU percentage for BOINC to use from the GUI, or suspend the whole project.

I detached the machine from the project, and reattached. Three more tasks downloaded, and each used 100% of a core.

I've suspended the project and killed the tasks.

What should I do?


____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1080 - Posted: 6 May 2013, 13:08:45 UTC
Last modified: 6 May 2013, 13:15:18 UTC

I have max CPU set to 40%, and this was definitely individual processes stuck at 100% on one core, and not responding to anything other than killing them as root.

I've deleted the directory /var/lib/boinc-client/projects/casathome.ihep.ac.cn and reattached to the project - I'll see if that helps.

Fortunately, Debian Wheezy has just released, and new versions of things (including Boinc) have just arrived for Testing users.

I'm hoping I don't need to upgrade Boinc beyond Debian's main repository - obviously that keeps things simple.

Thank you for the suggestions.

Mike
Send message
Joined: 14 Jul 12
Posts: 6
Credit: 5,383
RAC: 0
Message 1084 - Posted: 6 May 2013, 17:41:32 UTC - in response to Message 1080.

Well, I've upgraded, detached from the project, deleted the directory in /var, and reattached.

% boinc --version
7.0.65 x86_64-pc-linux-gnu

The symptoms are better, but not fixed - the task runs at nearly 100%, but seems to stop running for a second every 5 seconds or so.

Nevertheless, it's causing my system to run too hot. I'm happy to help debug if there's anything I can do, but for now I'm afraid I'll have to keep it suspended.


Post to thread

Message boards : Number crunching : Work unit pegged at 100% CPU, 64 bit Debian