Failed to start package crsp_s1, rollback steps

Symtomps

Node# tail /var/adm/cmcluster/log/crsp_s1.log
Sep 25 01:16:50 – Node “” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with start argument. ***
Sep 25 01:16:50 – Node “” : Starting Oracle Clusterware at Tue Sep 25 01 :16:50 UTC 2018
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
Sep 25 01:16:50 – Node “” ERROR: Function oc_start_cmd: Failed to start Oracle Clusterware
Sep 25 01:16:50 [email protected] master_control_script.sh[5486]: ##### Failed to st art package crsp_s1, rollback steps #####
Sep 25 01:16:50 – Node “” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with stop argument. ***
Sep 25 01:16:50 – Node “” : Stopping Oracle Clusterware at Tue Sep 25 01 :16:50 UTC 2018
Sep 25 01:16:50 – Node “” Oracle Clusterware is already stopped
Sep 25 01:16:50 [email protected] master_control_script.sh[5486]: ###### Failed to s tart package for crsp_s1 ######

Node:home/ # cmviewcl

CLUSTER STATUS
<clustername> up

SITE_NAME Node_pri

NODE STATUS STATE
Node1 up running
Node2 up running

PACKAGE STATUS STATE AUTO_RUN NODE
prismp_sc up running enabled Node2

NODE STATUS STATE
Node3 up running

SITE_NAME Node_sec

NODE STATUS STATE
Node4 up running
Node5 up running
Node6 up running

MULTI_NODE_PACKAGES

PACKAGE STATUS STATE AUTO_RUN SYSTEM
SG-CFS-pkg up running enabled yes
SG-CFS-crsp_s1 up running enabled no
SG-CFS-crsp_s2 up running enabled no
crsp_s1 up (2/3) running enabled no
crsp_s2 up running enabled no
SG-CFS-prismp_s1 up running enabled no
SG-CFS-prismp_s2 down halted enabled no
prismp_s1 up (2/3) running enabled no
prismp_s2 down halted enabled no
Node:home/ #

Causes

It looks like network connection issue as per below log:

Node1:/ $ tail /u01/app/grid/11203/log/Node1/cssd/ocssd.log
2018-09-21 10:47:08.187: [ CSSD][27]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000299, LATS 275224262, lastSeqNo 225000296, uniqueness 1519012441, timestamp 1537526827/1334746757
2018-09-21 10:47:08.187: [ CSSD][27]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639603, LATS 275224262, lastSeqNo 224639600, uniqueness 1519018579, timestamp 1537526827/1328775359
2018-09-21 10:47:08.190: [ CSSD][30]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639604, LATS 275224264, lastSeqNo 224639601, uniqueness 1519018579, timestamp 1537526827/1328775836
2018-09-21 10:47:08.197: [ CSSD][36]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2018-09-21 10:47:08.200: [ CSSD][33]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639605, LATS 275224274, lastSeqNo 224639602, uniqueness 1519018579, timestamp 1537526828/1328775956
2018-09-21 10:47:09.196: [ CSSD][30]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000300, LATS 275225270, lastSeqNo 225000021, uniqueness 1519012441, timestamp 1537526828/1334747680
2018-09-21 10:47:09.196: [ CSSD][30]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639607, LATS 275225270, lastSeqNo 224639604, uniqueness 1519018579, timestamp 1537526828/1328776846
2018-09-21 10:47:09.197: [ CSSD][27]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000302, LATS 275225272, lastSeqNo 225000299, uniqueness 1519012441, timestamp 1537526828/1334747769
2018-09-21 10:47:09.207: [ CSSD][36]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2018-09-21 10:47:09.210: [ CSSD][33]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639608, LATS 275225284, lastSeqNo 224639605, uniqueness 1519018579, timestamp 1537526829/1328776966
Node1:/ $

 

When  tried to ping the CI gateway, it was failed:

Node1:11203/bin # ping CI-GW
PING CI-GW: 64 byte packets

 

Resolutions

The current config of lan interface of CI is lan1, so it need to be changed to other working lan interface that having States Link UP.

After changed to other working lan, it works fine:

Node1:11203/bin # ping CI-GW
PING CI-GW: 64 byte packets
64 bytes from CI-GW: icmp_seq=0. time=0. ms
64 bytes from CI-GW: icmp_seq=1. time=0. ms


Then, the toolkit of crsp can be started:

Node1:11203/bin # /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit_oc.sh start
Sep 25 02:46:46 – Node “Node1” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with start argument. ***
Sep 25 02:46:46 – Node “Node1” : Starting Oracle Clusterware at Tue Sep 25 02 :46:46 UTC 2018
Sep 25 02:46:46 – Node “Node1” Oracle Clusterware is already started
Node1:11203/bin #

After that, the switching mod of the crsp package need to be enabled:

Node:11203/bin # cmmodpkg -e -v -n Node1 crsp_s1
Enabling node Node1 for switching of package crsp_s1
Successfully enabled package crsp_s1 to run on node Node1
cmmodpkg: Completed successfully on all packages specified
Node1:11203/bin # cmrunpkg crsp_s1
Package crsp_s1 is already running on all active nodes
cmrunpkg: All specified packages are running
Node1:11203/bin #

We may verify the running packages by cmviewcl command:

Node1:11203/bin # cmviewcl

CLUSTER STATUS
<clustername> up

SITE_NAME Site_pri

NODE STATUS STATE
Node1 up running
Node2 up running

PACKAGE STATUS STATE AUTO_RUN NODE
prismp_sc up running enabled Node3

NODE STATUS STATE
Node3 up running

SITE_NAME Site_sec

NODE STATUS STATE
Node4 up running
Node5 up running
Node6 up running

MULTI_NODE_PACKAGES

PACKAGE STATUS STATE AUTO_RUN SYSTEM
SG-CFS-pkg up running enabled yes
SG-CFS-crsp_s1 up running enabled no
SG-CFS-crsp_s2 up running enabled no
crsp_s1 up running enabled no
crsp_s2 up running enabled no

#################################################