cas is suspended until problem is fixed
log in

Advanced search

Message boards : Number crunching : cas is suspended until problem is fixed

Author Message
DskRbt
Send message
Joined: 23 Jul 10
Posts: 2
Credit: 7,602
RAC: 0
Message 154 - Posted: 19 Aug 2010, 4:32:10 UTC
Last modified: 19 Aug 2010, 4:34:33 UTC

http://casathome.ihep.ac.cn/forum_thread.php?id=23

If you have a question or problem, please use the Questions & Answers section of the message boards. (why no one loooks there )

One World, One Dream
Send message
Joined: 2 Jul 10
Posts: 24
Credit: 84,500
RAC: 0
Message 155 - Posted: 19 Aug 2010, 10:46:53 UTC

Hello dskrbt, it is normal that some tasks get aborted by the project.
Your Boinc manager says "cancelled by project / abgebrochen durch Projekt / 已被项目终止" if the task you downloaded has already been finished by others and your result is no longer needed.
Many projects do this to prevent redundant results. When the project communicates with your Boinc manager and sees that your task has not started yet and its result is not needed anymore, it will automatically jump to 100% and download a new work unit.
No CPU time will be used on that cancelled task, so you do not need to worry about wasted computation time.
This behavior happens more often with cas@home than with other projects because your Boinc client connects with the server once every minute, while other projects have longer time spans to connect with the server.
I hope my explanation answers your question.

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 156 - Posted: 19 Aug 2010, 13:19:19 UTC - in response to Message 155.
Last modified: 19 Aug 2010, 14:07:04 UTC

Your Boinc manager says "cancelled by project if the task you downloaded has already been finished by others and your result is no longer needed.
Many projects do this to prevent redundant results.


No, *MANY* do not! I am in 30 top projects, and none of them have this problem. Yes, occasionally I have seen the *aborted by project* message but it's very rare. Nor do I know if there had been any work started on the WU before it was aborted. I wasn't watching.

When the project communicates with your Boinc manager and sees that your task has not started yet and its result is not needed anymore, it will automatically jump to 100% and download a new work unit.


BUt!.. What if the WU has been started??

No CPU time will be used on that cancelled task, so you do not need to worry about wasted computation time.

This behavior happens more often with cas@home than with other projects because your Boinc client connects with the server once every minute, while other projects have longer time spans to connect with the server.
I hope my explanation answers your question.


What is the reason BOINC communicates with CAS@H more often than the other projects? Even if BOINC communicates with CAS@H every minute, given the short processing time for CAS#H WUs, processing is well underway on all the WUs. So if the WU has begun processing and the the project checks on its status, what happens then? CAS@H is sending out 3 replicas of a WU. Then it must wait for two of them to be validated. The time between being sent and the time the WU is validated, is more than enough time for all 3 to be completed. Why would CAS@H abort any WU until it's been validated. That would defeat the purpose redundancy.

I'm sorry, but I find your explainations simply don't hold water. Because I was aware there may be problems with CAS@H WUs, I WATCHED the whole process completely. From start to finish. The WU was received, processing was started as normal, and I watched it thru to completion. It took about 38 minutes, and at no time was the WU aborted (as would happen in your scenario) until I uploaded it. Nor was I paid for the WU! So *DO NOT* tell me the WUs are aborted *BEFORE* they have started. It's not true! Nor was this just one incidence, but one of many. So your explainantion may work for some, it's not true in my case. And I KNOW this happening to others.

MY point in this whole matter is not the redundancy but in credits for work done. IT *IS* being done and *NOT* being credited for!! I listed 5 WUs earlier in 'Granted Credits' that I observed personally. Those, alone, contradict your explainations. What aggravates me is the fact that out of 20 WUs I *COMPLETED*, 9 of them were aborted. That's almost half! Because of this and others issue I have, this project is beginning to be a waste of my time. I do like CAS@H and the science behind it, but either these issues are resolved or I move on. There is no reason for us to have to deal with all these aborted WUs. The issue of redundancy can be handle in much easier ways, but ADMIN refuses to implement them.

What I do question is where you get your facts? Are you part of the CAS@H admin or just a volunteer. Or is there a page where I can find them? Since the 'facts of life' don't support your 'theories of evolution', why dont you do some real research? You could begin with reading earlier posts in the other forums.

One World, One Dream
Send message
Joined: 2 Jul 10
Posts: 24
Credit: 84,500
RAC: 0
Message 158 - Posted: 19 Aug 2010, 13:46:10 UTC

I see. I am sorry, I thought my explanations were correct. At least that is how the project has worked for me so far. I still have not encountered any work units that were cancelled after they were finished, but I believe you that this indeed happened with some people. Hopefully Mr. Wu Jie can solve the problem soon.

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 159 - Posted: 19 Aug 2010, 14:16:48 UTC - in response to Message 158.

Thanks for your input. I hope he will. As I said, there are much better ways to handle redundancy than aborting all these work units. He just needs to implement them.

BTW, I am retired and have a lot of time to watch most of what goes with my projects. Maybe this will help you understand how I came to my conclusions.

Regards

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 160 - Posted: 19 Aug 2010, 14:25:56 UTC - in response to Message 154.

http://casathome.ihep.ac.cn/forum_thread.php?id=23

If you have a question or problem, please use the Questions & Answers section of the message boards. (why no one loooks there )


Probably because Q&A rarely answers any questions!! :> Most likely it's 'cause most of the WU processing issues are handled in the 'Number Crunching' forums. That's were people are reading and replying to posts.

Profile Jie Wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Send message
Joined: 14 Mar 10
Posts: 60
Credit: 3,474
RAC: 0
Message 165 - Posted: 20 Aug 2010, 14:12:43 UTC

The issue of redundancy can be handle in much easier ways, but ADMIN refuses to implement them.


Firstly, tell me how! no philosophy, step by step!

Secondly, tell me when I said "refuse", or how you got this conclusion?

Profile Jie Wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Send message
Joined: 14 Mar 10
Posts: 60
Credit: 3,474
RAC: 0
Message 166 - Posted: 20 Aug 2010, 14:12:46 UTC

The issue of redundancy can be handle in much easier ways, but ADMIN refuses to implement them.


Firstly, tell me how! no philosophy, step by step!

Secondly, tell me when I said "refuse", or how you got this conclusion?

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 169 - Posted: 20 Aug 2010, 16:59:18 UTC - in response to Message 166.
Last modified: 20 Aug 2010, 17:07:22 UTC

The issue of redundancy can be handle in much easier ways, but ADMIN refuses to implement them.


Firstly, tell me how! no philosophy, step by step!

Secondly, tell me when I said "refuse", or how you got this conclusion?


Jie Wu, I did not say you 'refused' at all. Sorry for my inconsideration. I meant 'refused' as in not 'acting' to resolve the problem. I appologize for the misunderstanding.

I feel that requiring 3 replicas to be sent out at a time, invites troubles. There are to many different systems out here to use such an ineffecient method for redundancy. Although I have no statistical facts to show how many WU actually fail at successful processing, from my own personal experience, my system rarely fails. If that is the case with most hosts, then the chances of failing to get the necessary validations of a WU is fairly low.

So here's what I propose...

Try sending only 2 replicas out first, instead of your usual 3. Then see how many times you get validation on the first try. If the majority of the time, you get the 2 you need for quoram on the first sending, stop sending the 3rd. Use a third sending ONLY as need for a valid quarom. If you look at most of the other projects requirements, you see 2 for quoram and 2 for replica. Rarely is 3 replicas sent at first. (This only applies to the projects I am attached to).

By doing it this way, you reduce or eliminate the excessive amount of aborted WUs. When I look at my task view and I see all those aborted WUs, I get aggravated. Believing that, as I do, those WU had been completed and not granted any credit. And I'm sure other users are seeing and feeling the same way. Many probably just say, 'screw it' and detach, rather than trying to fix the problems. I pefer to stay and fix them.

Try to see our point of view. When I look at any of my other project's task view, I do not see a long list of aborted WUs. So, obviously, the other projects are handling redundancy in a different way. Their stats prove that. I am asking for your considerations in helping reduce the number of aborted WUs and the fustrations that may accompany them.

I mean no disrepect what so ever, Jie WU. Again I appologize for any misunderstandings.

Regards,
Sorceress

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 170 - Posted: 20 Aug 2010, 17:43:36 UTC

I have checked all 30 of my current projects (you can view my project list) and NONE require 3 for 'initial replica'. NONE. CAS@H is the only one. It's either 1 or 2 what I see. Requiring 3 initial replicas is the cause for the large amount of aborted WUs and wasted redundancy . I feel this is unnecessary aggravation and could be easily be avoided by a 2 or less inital replication. IMHO, that is the best solution.

Profile Jie Wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Send message
Joined: 14 Mar 10
Posts: 60
Credit: 3,474
RAC: 0
Message 172 - Posted: 22 Aug 2010, 16:00:05 UTC - in response to Message 170.
Last modified: 22 Aug 2010, 16:04:18 UTC

Please look at http://www.boinc-wiki.info/Min_quorum. Apparently, in some projects, the min_quorum is equal to target_nresults while in some projects, it is not. Actually, it is common to ask arget_nresults to be more than min_quorum in some projects, because in these projects, scientists usually want to get results quickly. Supposing this, the min_quorum is equal to target_nresults and a voluteer gets a WU, then due to his business trip for a month, he has to shut down his computer before the WU is finished, what will happen? The answer is that the project server has to wait before the WU exceeds the deadline( BOINC default value is 7 days), and then sends another copy of WU to another computer. It is not acceptable to scientists who want results quickly.

The above is the reason why I set the values like that.

Profile ChertseyAl
Send message
Joined: 17 Jun 10
Posts: 4
Credit: 50,920
RAC: 0
Message 174 - Posted: 22 Aug 2010, 17:51:37 UTC - in response to Message 172.

I've been following this issue for some time, as I too have stopped crunching it because of the aborts. I crunch, and have crunched, many projects. This is one of the worst in terms of 'ease of use' for the volunteer. And it's volunteers that you rely upon.

The answer is that the project server has to wait before the WU exceeds the deadline( BOINC default value is 7 days), and then sends another copy of WU to another computer. It is not acceptable to scientists who want results quickly.


Ok ...

1) Wasting volunteers computer time, electricy costs etc is unacceptable. Full stop. Just credit the redundant result if you insist on setting IR higher than MQ. Even LHC saw the error of their ways and abandoned this crazy scheme.

2) If you need the results sooner, set a tighter deadline. Try 5 days. or 3 days. Don't set it any lower as your project will force every client into panic mode and annoy volunteers.

3) Have you considered that maybe BOINC isn't a suitable platform for this project? There are many projects that don't use it. BOINC is meant to be easy and friendly for volunteers to use. Tight deadlines and frequent aborted WUs make it neither.

4) I'm not an expert on BOINC, but I do run 10 PCs dedicated to it (and a couple of others in more general use), and in the height of Summer burning a kilowatt of electricity for no reward is not something that makes me happy ;)

Don't take any of the above personally - It's sometimes tricky to get a project running smoothly and I hope you reach a sensible compromise between your requirements and those of the volunteers :)

Al.

Profile Jie Wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Send message
Joined: 14 Mar 10
Posts: 60
Credit: 3,474
RAC: 0
Message 175 - Posted: 22 Aug 2010, 20:33:32 UTC - in response to Message 174.
Last modified: 22 Aug 2010, 20:55:58 UTC

1) Wasting volunteers computer time, electricy costs etc is unacceptable. Full stop. Just credit the redundant result if you insist on setting IR higher than MQ. Even LHC saw the error of their ways and abandoned this crazy scheme.


Firstly, our project didn't waste, isn't wasting and will not waste voluteers' computer time!!!!

Secondly, I think it is necessary to make BOINC abortion principle clear! Normally, BOINC will abort redundant task which has not started calculating, the progress bar of the task will directly jump from 0 to 100% , it do not consume any cpu time! Of course, the server don't also grant credits to you. But if it has start crunching, it will not be aborted! When it is finished, it will be reported to server, server will validate it and confirm whether it should be granted credits. I am sure our project server is handling the workuint like this, and abortion turp up when the task has not started. So, when you run into abortion, the first thing you should do is not to suspend the project, but is to check why your task is aborted!

Lastly, are you sure you know why we are talking about abortion problem? If you are not, please read all of this thread and another thread http://casathome.ihep.ac.cn/forum_thread.php?id=20 and then go on.
As Sorceress said, it may be "switch" problem, I think only to modify the configuration of server can not solve the problem, it is better to report it to BOINC develping team!

2) If you need the results sooner, set a tighter deadline. Try 5 days. or 3 days. Don't set it any lower as your project will force every client into panic mode and annoy volunteers.


I have considered a tighter deadline. Like 5 days or 3 days, they are the same as 7 days to scientists. But it is almost impossoble to make it shorter. Different tasks need different time, and different computers need different
time to calculate the same task! So at last, the deadline is definitely far than the average. Trust me, If I set up a tighter deadline, complaint will drown me!


3) Have you considered that maybe BOINC isn't a suitable platform for this project? There are many projects that don't use it. BOINC is meant to be easy and friendly for volunteers to use. Tight deadlines and frequent aborted WUs make it neither.


SCThread application are confirmed after four voluteer computing experts and some other fields expert talked in CAS@home workshop on 9th March. It is definitely suited to voluteer computing!

By the way, I advise stopping talk about this topic in this thread, move to this thread:
http://casathome.ihep.ac.cn/forum_thread.php?id=20

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 176 - Posted: 22 Aug 2010, 21:31:25 UTC - in response to Message 172.
Last modified: 22 Aug 2010, 22:04:11 UTC

Please look at http://www.boinc-wiki.info/Min_quorum. Apparently, in some projects, the min_quorum is equal to target_nresults while in some projects, it is not. Actually, it is common to ask arget_nresults to be more than min_quorum in some projects, because in these projects, scientists usually want to get results quickly. Supposing this, the min_quorum is equal to target_nresults and a voluteer gets a WU, then due to his business trip for a month, he has to shut down his computer before the WU is finished, what will happen? The answer is that the project server has to wait before the WU exceeds the deadline( BOINC default value is 7 days), and then sends another copy of WU to another computer. It is not acceptable to scientists who want results quickly.


I looked at the link you posted and most all of the 'initial replica' are totally incorrect. If your basing your project's 'mode of operations' on Wiki, no wonder you are so far off target. Wiki is a great information source but it's completely unreliable for accuracy and rarely up to date. I crunch for five of the project listed, SETI, Einstein, Simap, Climate Prediction, and Rosetta. The only 'initial replica' on that list that is correct is Climate Prediction. I can easily bet the the rest are inaccurate as well.

This issue is obviously, going nowhere. I told you how to reduce the large number of aborted workunits you have by setting the 'initial replica' to 2 or less. Send a third only if you need validation for a min_quoram. You are wasting our time and resources with this nonsense, whether you think so or not. It's aggravating to see that many aborted projects.

You have a choice, Stick by your 'guns' and continue to push the 3 'initial replicas' on us, with all those aborted WU's piling up. Or change it to 2 and see if that works. I'm tired of the abortions and the lack of consideration for the users. Coupled with the other issues I have with how this project works, its time for me to drop it. I do hope you will see some rational. Otherwise, I'm outta here.

It's your call!

Let me say one more thing. I don't think you understand the the project/participant symbiosis. We attach our computers to projects for serveral reasons. Some for the science and the desire to help in the research. Some for the 'bragging rights' in having the most credits. And some of a little of both. But in all cases, we do this for fun. Distributed computing is all 'volunteer" When a project is causing or having problems, its no longer FUN. Get my point? With all the aborted WUs, you project is not fun and a source of irritation. When that happens, you will start losing your support.

Profile Jie Wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Send message
Joined: 14 Mar 10
Posts: 60
Credit: 3,474
RAC: 0
Message 180 - Posted: 23 Aug 2010, 8:56:01 UTC


Everyone should feel tired! let's stop to argue about this problem! I have modified the configuration of server. In several hours, you will receive the task whoes min_quorum and target_nresults are the same!

Profile Sorceress
Avatar
Send message
Joined: 25 Jul 10
Posts: 40
Credit: 2,816
RAC: 0
Message 181 - Posted: 23 Aug 2010, 23:41:06 UTC - in response to Message 180.
Last modified: 24 Aug 2010, 0:02:36 UTC

Jie Wu, your assistance in finally resolving a very frustrating situation, is much appreciated. The new WUs I received are now set appropriately to 2 'init_replicas'. I appologize for my hard-edged approach to resolving this problem, but it had to be fixed. It was never my intention to be disrespectful, in any event. I feel certain that operations will run much more smoothly now that it has. Especially with the misunderstandings and aggravations all those aborted WUs were causing. We still have a few more areas to work on, like increasing the credit levels and and making some changes to the website GUI, But for now, those can wait upon your convenience. We will talk about those later. I am glad I don't have to detach from CAS@H. I do like the science your doing and want to do my part to help. Thanks from all of us.

BTW, A word of advice. Wikipedia is really not a good place to get any kind of 'accurate' information, of ANY kind. While Wiki does provide useful information about a lot of things, you can not take everything it say 'verbatum'. Caution and a little common sense is the key to using Wiki. In our case, contact with the other project's admins, would have been much more productive and reliable. Wikipedia is like the Christain Bible. Written by men with good intentions.

Cheers


Post to thread

Message boards : Number crunching : cas is suspended until problem is fixed