Updates on Nanotechnology application
log in

Advanced search

Message boards : News : Updates on Nanotechnology application

1 · 2 · Next
Author Message
wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 818 - Posted: 25 Nov 2012, 13:43:30 UTC
Last modified: 18 Feb 2013, 7:00:37 UTC

我们最近发现Lammps应用中存在一些bug:
1 一些计算结果没有得到验证,这是因为验证服务中有一个bug。
2 一些作业的估计的计算时间和计算的期限不一致,出现了作业运行时间大于最后期限的情况
3 某些志愿者反馈,从BOINC GUI停止lammps程序后,但是应用程序并未真正退出。
目前,bug1已经解决。bug2和bug3还在进一步的调查中。
谢谢各位的支持,由于本人近期在外出差,回复有所延迟,在此致歉!


更新:
对于bug2, 确实存在某些作业的运行时间超过几百个小时,但是deadline没有相应地被设置。我们已经做了调整,将deadline设置为实际运行时间的8倍。如果你看到某些作业的deadline显示为2013年,那么恭喜你,你的主机上获得了一个需要运行800小时左右的Lammps作业。请不要惊慌,我们的作业有实现checkpoint,即使你关机,之前计算所获得的结果也不会丢失。
此外,看到下面众多针对这个deadline的留言,有必要解释一下:
我们的Lammps用户早期对作业的参数(直接影响作业的运行时间)范围较小,所以默认单个作业的最多运行时间为3个小时,因此我们为每个作业设置一个固定的deadline(3天)。但是后来用户进行一批新的模拟,关键参数采取了随机数,使得作业的运行时间出现了很大变化(从几十分钟到上千小时不等)。Lammps用户与CAS@home的平台管理人员没有沟通这个问题,因此致使大量的作业仍然采用了3天的deadline。发现这个问题后,我们及时做了调整,修改了程序,按照单个作业的估测时间设置动态的deadline。但是估测的时间也难免出现偏差。我们正基于目前已收回的结果,统计估计时间的偏差,以做进一步的调整。
BOINC默认将快接近deadline的作业具有较高的优先级,所以前期那些具有三天deadline的作业占据了较大的优先级,在没有准确估计好作业的运行时间前,我们设置了一个非常宽松的deadline,不至于浪费志愿主机的计算时间,也不会对其他项目造成不公平。


对于bug3:
我们已经更改了程序中存在的bug,BOINC的wrapper也得到升级。由于Windows平台API的限制,在BOINC GUI 暂停一个任务的时候,从任务管理器可以看到lammps的程序“缓慢”消失(大约需要5-20秒,取决于当时主机运行的进程数目的多少),根据我们的测试结果看,不应该再存在无法退出的Lammps进程。

We have recently found a few bugs in the Nanotechnology application (Lammps):
1. new results are not validated due to a bug in the validator
2. deadline and estimated computing time are not consistent, ie, some jobs are estimated to run longer than the deadline.
3. some volunteers reported that the lammps application does not terminate after jobs being suspended from BOINC GUI.
I have fixed bug No.1, but still need some investigation to fix bug No.2 and No.3 . Sorry for the inconvenience and also many thanks to your feedback.

Updates:
Bug2 is fixed now, and the deadline is set as 8 times of the estimated running time of the each job. Also some of the Lammps jobs can last up to 800hours, so if you are lucky to get this kind of jobs on your host, you will see the deadline is marked to July 2013, and do not panic about it!We have implemented checkpoint in Lammps jobs, so even you restart your computer, you will not lose what has been finished, and the jobs will continue from where it was stopped.


About Bug3:
We have fixed a few bugs in our application and also update the BOINC wrapper. Based on the results on our test jobs, there should not be any zombie lammps processes--processes will not be killed by BOINC suspend-- any more.But due to the lack of appropriate Win API of suspending a particular process, it takes a while--usually 5-20 seconds and it depends on the number of processes running on your host-- to see the lammps processes disappear from the task manager.
____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Siegfried
Send message
Joined: 9 Jan 11
Posts: 1
Credit: 29,217
RAC: 0
Message 820 - Posted: 26 Nov 2012, 10:00:50 UTC - in response to Message 818.

这个进程有问题 就是一个吃CPU怪兽啊~ 于是只好把CAS禁了……
____________

syLaaar
Send message
Joined: 8 Oct 12
Posts: 1
Credit: 17,814
RAC: 0
Message 821 - Posted: 26 Nov 2012, 12:28:09 UTC - in response to Message 818.

我一看还没算完就要上交 我就不算了 停止运算了。。。。都算大半了啊。。

AQY
Send message
Joined: 25 Jun 10
Posts: 5
Credit: 255,042
RAC: 0
Message 823 - Posted: 27 Nov 2012, 1:36:10 UTC - in response to Message 818.

deadline实在是太早了,好好向seti学习

Issac
Send message
Joined: 1 Mar 11
Posts: 7
Credit: 179,705
RAC: 0
Message 829 - Posted: 27 Nov 2012, 14:20:18 UTC

I just receive a task with deadline on 3rd July 2013!

27/11/2012 21:53:34 | CAS@home | Finished download of sb_12659_add81dcc0fe3be4d06a71b222d0a1bd4

Don't know if it is a problem,
and don't know if the above job number will help the administrator to trace the problem, if any.

Thx

Administrator
Send message
Joined: 9 Jul 12
Posts: 2
Credit: 12,588
RAC: 0
Message 830 - Posted: 28 Nov 2012, 1:09:59 UTC - in response to Message 818.

对于3,我也发现有类似问题,boinc已经退出了,任务管理器里面还有某些进程在运行,非常耗cpu……

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 838 - Posted: 29 Nov 2012, 14:43:32 UTC - in response to Message 829.

Yes, we set the deadline in a very agressive way--8 times of the real CPU time each job requires-- and some of the Lammps jobs can last up to 800hours, that is why you see the deadline is to July next year.

I just receive a task with deadline on 3rd July 2013!

27/11/2012 21:53:34 | CAS@home | Finished download of sb_12659_add81dcc0fe3be4d06a71b222d0a1bd4

Don't know if it is a problem,
and don't know if the above job number will help the administrator to trace the problem, if any.

Thx


____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 839 - Posted: 29 Nov 2012, 14:44:49 UTC - in response to Message 821.

我们已经将作业的deadline重新设置了,即每个作业的deadline为实际运行时间的8倍。

我一看还没算完就要上交 我就不算了 停止运算了。。。。都算大半了啊。。


____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Tommy
Send message
Joined: 21 Nov 12
Posts: 15
Credit: 442,828
RAC: 0
Message 856 - Posted: 30 Nov 2012, 2:33:20 UTC - in response to Message 838.

不过我用了24小时就完成了一个截止到2013年7月的计算任务....
batch_972_1970

虽然我得承认我比较喜欢2013年7月这样晚的deadline

等待黎明
Send message
Joined: 20 Nov 11
Posts: 2
Credit: 35,465
RAC: 0
Message 881 - Posted: 2 Dec 2012, 6:01:44 UTC

清华大学的nano项目计算量太大了!500多万GFLOPs的量,居然要我三天就上交,就是24h不停计算也来不及啊!更何况不是人人都该为了给你这计算项目,24h开机,也太不环保了。毕竟BOINC这是利用计算机多余资源进行的活动。就不能把计算包做小一点吗?像climateprediction.net项目,也只是71万GFLOPs的量,还放宽到一年时间,当我们都是在用超级计算机吗?另外,这项目居然还是高优先项目,硬是霸占了其他项目的资源,简直像流氓软件,而且分值还那么的低。更要命的是辛辛苦苦花了210h,算到了50%,居然消失了,这算什么?自动撤销吗?!
想为中国的科学项目出点力,做点事,就那么困难!觉得很不值得。急功近利,马马虎虎,自以为是。也许你们根本就没在用心搞项目,没有用心对待志愿者的贡献!好好向你们的国外同行学学吧!

炕苕
Send message
Joined: 8 Nov 12
Posts: 1
Credit: 4,847
RAC: 0
Message 908 - Posted: 6 Dec 2012, 10:49:37 UTC

从事分布式计算大概快10年了,这是我运行过最差的一个项目,做为一个中国科学院的毕业生,我为你们这样编写出如此差劲的分布式计算软件感到难过,这就是现在中科院学生的能力?先是短到根本无法完成的时间,我敢说这样的任务你们甚至都没有进行过完整的调试就上线了,不然根本不可能出现这种无比低级的错误!你们就这样做科研,这么糊弄你们的导师吗?这样的东西如果你们拿去发表文章,别人做个模拟,实验无法重复,数据涉嫌造假,你们还有你们导师这辈子就毁了!然后就是800小时的一个任务,连预计完成时间不敢写清楚,都写什么3小时,结果要用到800个小时,那预计时间还有意义吗?极其霸道的全部都是高优先级项目,只要运行了你的项目,别人的项目基本上就都靠边站了,连最基本的游戏规则都不遵守,国内玩加塞玩多了吗?跟几个国外玩分布式计算的朋友推荐过你们这个项目,我现在十分后悔这个举动,丢人丢到国外去了!如果这个项目的PM没有精力或者能力管理这个项目,为了中国科学院的名誉,请你退出,如果这就是现在中国科学院学生的能力,我只能说可悲!

Ubdaddy
Send message
Joined: 6 Nov 10
Posts: 3
Credit: 12,904
RAC: 0
Message 911 - Posted: 9 Dec 2012, 12:21:10 UTC - in response to Message 838.
Last modified: 9 Dec 2012, 12:22:06 UTC

Dear wenjing wu,

Setting the deadline into the distant future (8/28/2013 for my WUs) is fine if one assumes that CAS is the only BOINC client running on the participant computer.

Where it is not the case (me for example) BOINC manager simply ignores these WUs and run the other clients.

I had to suspend all other clients WUs in order for Nano Tech to run, but I will not want to make a habit out of it.

Best regards,

Ubdaddy

Arthur200000
Send message
Joined: 9 Dec 12
Posts: 2
Credit: 874
RAC: 0
Message 913 - Posted: 10 Dec 2012, 10:19:12 UTC - in response to Message 829.
Last modified: 10 Dec 2012, 10:20:04 UTC

Bug2 is fixed now, and the deadline is set as 8 times of the estimated running time of the each job. Also some of the Lammps jobs can last up to 800hours, so if you are lucky to get this kind of jobs on your host, you will see the deadline is marked to July 2013, and do not panic about it!We have implemented checkpoint in Lammps jobs, so even you restart your computer, you will not lose what has been finished, and the jobs will continue from where it was stopped.

Here is the update.

Shanghai Experimental School Class1 Grade3

China

Ubdaddy
Send message
Joined: 6 Nov 10
Posts: 3
Credit: 12,904
RAC: 0
Message 916 - Posted: 10 Dec 2012, 16:22:05 UTC - in response to Message 913.

I know how BOINC works, checkpointing and all and that some projects have long WUs, I used to run CPDN.

You didn't address the issue I raised.

On my machine BOINC runs other projects as well, all of them with deadlines within few weeks.

Which do you think the BOINC manager will prefer to run, one with deadline next summer or other shorter deadline.

Regards

Profile ritterm
Avatar
Send message
Joined: 21 Jul 10
Posts: 17
Credit: 157,228
RAC: 0
Message 918 - Posted: 11 Dec 2012, 14:09:57 UTC - in response to Message 916.
Last modified: 11 Dec 2012, 14:14:27 UTC

Which do you think the BOINC manager will prefer to run, one with deadline next summer or other shorter deadline.

What about the resource share setting? If your setting for CAS is the same as your other projects, won't that even out the amount of work done even if the CAS deadline is far in the future?

On all of my hosts, I have these distant-deadline CAS tasks running just fine alongside other tasks with deadlines ranging from days to weeks in the future.
____________

Profile ritterm
Avatar
Send message
Joined: 21 Jul 10
Posts: 17
Credit: 157,228
RAC: 0
Message 919 - Posted: 12 Dec 2012, 3:45:08 UTC

Compute errors on these Nano Tech tasks:

20059991
20052196
20043050
20042137

Similar stderr output on all:

close failed: [Errno 9] Bad file descriptor
20:23:00 (4484): wrapper: running parse_result.exe ()
Traceback (most recent call last):
File "lammps_parse_result_1.00_windows_intelx86.py", line 19, in <module>
IOError: [Errno 9] Bad file descriptor
close failed: [Errno 2] No such file or directory
app exit status: 0xff
20:23:05 (4484): called boinc_finish

</stderr_txt>
]]>


8 days of crunching down the drain... :-(
____________

wenjing wu
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Avatar
Send message
Joined: 13 Sep 10
Posts: 161
Credit: 751,216
RAC: 0
Message 939 - Posted: 24 Dec 2012, 15:17:19 UTC - in response to Message 919.

I think there was a bug in one of the applications which extracts the results.That is why some jobs failed after hours of computing. It is a a real pity. I have a suspect about where the bug is , and made the changes. As I can not reproduce the error, so I am not 100% sure if that is where the problem is. Anyway, a new version is released Lammps 1.27 based on it. Let us see how it works in a couple of days.
____________
加油!CAS@home!我们帮助科学家跟时间赛跑!
Go CAS@home! We help scientists to race against time!

Profile ritterm
Avatar
Send message
Joined: 21 Jul 10
Posts: 17
Credit: 157,228
RAC: 0
Message 940 - Posted: 24 Dec 2012, 19:45:29 UTC

For any others concerned about very long running tasks, Wenjing provide this feedback in a reply to a PM:

Wenjing Wu wrote:

I have talked to the scientists, they confirmed that there is nothing wrong in the job itself which would cause the termination, so I contacted the BOINC developer to see if there is anything might cause this from the client side. I understand your concern about it, and this is definetely something we need to solve to avoild wasting crunchers' CPU time!!


This situation my be overcome by the new application she mentions previously, but, it might be worth letting any 1.25 apps run to completion. I'm letting this task run and hope that it finishes successfully. At the time of this post, it's has just over 6 days of runtime, but is only ~14% complete.
____________

Profile ritterm
Avatar
Send message
Joined: 21 Jul 10
Posts: 17
Credit: 157,228
RAC: 0
Message 941 - Posted: 28 Dec 2012, 11:22:38 UTC

Task 20084369 failed... 6 more days of lost crunching. I'm suspending app 1.25 tasks while I consider aborting them... I don't want to waste any more time. :-(
____________

Richard
Send message
Joined: 4 Dec 10
Posts: 3
Credit: 31,886
RAC: 0
Message 942 - Posted: 29 Dec 2012, 3:14:38 UTC

最近CAS的bug层出不穷啊
单一任务运算了一个多星期,80个钟头的CPU时间。最后一计算错误告终,再这么发展下去还会来参与这个项目。[/img]

1 · 2 · Next
Post to thread

Message boards : News : Updates on Nanotechnology application