The only interesting part from dmesg is:
- Code: Select All Code
[ 4.269262] [Hardware Error]: MC4 Error (node 0): Watchdog timeout due to lack of progress.
[ 4.271749] [Hardware Error]: Error Status: System Fatal error.
[ 4.272970] [Hardware Error]: CPU:0 (16:0:1) MC4_STATUS[Over|UE|-|PCC|AddrV|-|-]: 0xf600000000070f0f
[ 4.275455] [Hardware Error]: MC4_ADDR: 0x00000000fe80c000
[ 4.276651] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
which can suggest hardware issue. This only exists in dmesg after unclean reboot (after user initiated, clean reboot this message doesn't occure).
We have also MOI Pro-AMD with 4x TBS6991SE PCIe cards, in this configuration there is no such an issue. We removed 4x TBS6910 cards, and replaced them by 4xTBS6991SE cards (we put the same CAM modules taken from TBS6910). Problem hasn't occured since last week. We also put this removed TBS6910 cards into standard workstation computer, and they work fine.
So it looks like the problem is only when we connect TBS6910 with MOI Pro-AMD (which we got from your DE supplier as a complete setup).
So far we tried:
- update to stock Centos 7
- newest tbs drivers
- newest opensource drivers
- addding external power supply for TBS6910 cards
nothing helped.
We also have an interesting case with only one MOI Pro-AMD with 2xTBS6205 cards. In this config reboot happens after every "sensors" command. It doesn't happen randomly as in previously described setup. Also we tried 'sensors' commnad without tbs drivers loaded
- Code: Select All Code
[root@streamer ~]# rmmod tbs_pcie_dvb
[root@streamer ~]# rmmod tbs6205fe
[root@streamer ~]# uname -a
Linux streamer.local 3.10.0-229.el7.x86_64 #1 SMP Fri Mar 6 11:36:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@streamer ~]# sensors
radeon-pci-0008
Adapter: PCI adapter
... (reboot)
After reboot exactly the same [Hardware Error] message is in dmesg (as attached before). Maybe this problem is connected with new cards with FPGA PCIe bridge?