Failed to start package crsp_s1, rollback steps

Symtomps

Node# tail /var/adm/cmcluster/log/crsp_s1.log
Sep 25 01:16:50 – Node “” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with start argument. ***
Sep 25 01:16:50 – Node “” : Starting Oracle Clusterware at Tue Sep 25 01 :16:50 UTC 2018
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
Sep 25 01:16:50 – Node “” ERROR: Function oc_start_cmd: Failed to start Oracle Clusterware
Sep 25 01:16:50 [email protected] master_control_script.sh[5486]: ##### Failed to st art package crsp_s1, rollback steps #####
Sep 25 01:16:50 – Node “” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with stop argument. ***
Sep 25 01:16:50 – Node “” : Stopping Oracle Clusterware at Tue Sep 25 01 :16:50 UTC 2018
Sep 25 01:16:50 – Node “” Oracle Clusterware is already stopped
Sep 25 01:16:50 [email protected] master_control_script.sh[5486]: ###### Failed to s tart package for crsp_s1 ######

Node:home/ # cmviewcl

CLUSTER STATUS
<clustername> up

SITE_NAME Node_pri

NODE STATUS STATE
Node1 up running
Node2 up running

PACKAGE STATUS STATE AUTO_RUN NODE
prismp_sc up running enabled Node2

NODE STATUS STATE
Node3 up running

SITE_NAME Node_sec

NODE STATUS STATE
Node4 up running
Node5 up running
Node6 up running

MULTI_NODE_PACKAGES

PACKAGE STATUS STATE AUTO_RUN SYSTEM
SG-CFS-pkg up running enabled yes
SG-CFS-crsp_s1 up running enabled no
SG-CFS-crsp_s2 up running enabled no
crsp_s1 up (2/3) running enabled no
crsp_s2 up running enabled no
SG-CFS-prismp_s1 up running enabled no
SG-CFS-prismp_s2 down halted enabled no
prismp_s1 up (2/3) running enabled no
prismp_s2 down halted enabled no
Node:home/ #

Causes

It looks like network connection issue as per below log:

Node1:/ $ tail /u01/app/grid/11203/log/Node1/cssd/ocssd.log
2018-09-21 10:47:08.187: [ CSSD][27]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000299, LATS 275224262, lastSeqNo 225000296, uniqueness 1519012441, timestamp 1537526827/1334746757
2018-09-21 10:47:08.187: [ CSSD][27]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639603, LATS 275224262, lastSeqNo 224639600, uniqueness 1519018579, timestamp 1537526827/1328775359
2018-09-21 10:47:08.190: [ CSSD][30]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639604, LATS 275224264, lastSeqNo 224639601, uniqueness 1519018579, timestamp 1537526827/1328775836
2018-09-21 10:47:08.197: [ CSSD][36]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2018-09-21 10:47:08.200: [ CSSD][33]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639605, LATS 275224274, lastSeqNo 224639602, uniqueness 1519018579, timestamp 1537526828/1328775956
2018-09-21 10:47:09.196: [ CSSD][30]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000300, LATS 275225270, lastSeqNo 225000021, uniqueness 1519012441, timestamp 1537526828/1334747680
2018-09-21 10:47:09.196: [ CSSD][30]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639607, LATS 275225270, lastSeqNo 224639604, uniqueness 1519018579, timestamp 1537526828/1328776846
2018-09-21 10:47:09.197: [ CSSD][27]clssnmvDHBValidateNcopy: node 2, Node2, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 225000302, LATS 275225272, lastSeqNo 225000299, uniqueness 1519012441, timestamp 1537526828/1334747769
2018-09-21 10:47:09.207: [ CSSD][36]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2018-09-21 10:47:09.210: [ CSSD][33]clssnmvDHBValidateNcopy: node 3, Node3, has a disk HB, but no network HB, DHB has rcfg 414478488, wrtcnt, 224639608, LATS 275225284, lastSeqNo 224639605, uniqueness 1519018579, timestamp 1537526829/1328776966
Node1:/ $

 

When  tried to ping the CI gateway, it was failed:

Node1:11203/bin # ping CI-GW
PING CI-GW: 64 byte packets

 

Resolutions

The current config of lan interface of CI is lan1, so it need to be changed to other working lan interface that having States Link UP.

After changed to other working lan, it works fine:

Node1:11203/bin # ping CI-GW
PING CI-GW: 64 byte packets
64 bytes from CI-GW: icmp_seq=0. time=0. ms
64 bytes from CI-GW: icmp_seq=1. time=0. ms


Then, the toolkit of crsp can be started:

Node1:11203/bin # /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit_oc.sh start
Sep 25 02:46:46 – Node “Node1” *** /opt/cmcluster/SGeRAC/toolkit/crsp/toolkit _oc.sh called with start argument. ***
Sep 25 02:46:46 – Node “Node1” : Starting Oracle Clusterware at Tue Sep 25 02 :46:46 UTC 2018
Sep 25 02:46:46 – Node “Node1” Oracle Clusterware is already started
Node1:11203/bin #

After that, the switching mod of the crsp package need to be enabled:

Node:11203/bin # cmmodpkg -e -v -n Node1 crsp_s1
Enabling node Node1 for switching of package crsp_s1
Successfully enabled package crsp_s1 to run on node Node1
cmmodpkg: Completed successfully on all packages specified
Node1:11203/bin # cmrunpkg crsp_s1
Package crsp_s1 is already running on all active nodes
cmrunpkg: All specified packages are running
Node1:11203/bin #

We may verify the running packages by cmviewcl command:

Node1:11203/bin # cmviewcl

CLUSTER STATUS
<clustername> up

SITE_NAME Site_pri

NODE STATUS STATE
Node1 up running
Node2 up running

PACKAGE STATUS STATE AUTO_RUN NODE
prismp_sc up running enabled Node3

NODE STATUS STATE
Node3 up running

SITE_NAME Site_sec

NODE STATUS STATE
Node4 up running
Node5 up running
Node6 up running

MULTI_NODE_PACKAGES

PACKAGE STATUS STATE AUTO_RUN SYSTEM
SG-CFS-pkg up running enabled yes
SG-CFS-crsp_s1 up running enabled no
SG-CFS-crsp_s2 up running enabled no
crsp_s1 up running enabled no
crsp_s2 up running enabled no

#################################################

Unable to run package on node

Symptoms

When you try to bring up the package in service guard, the package wont coming up with below errors:

[[email protected] ~]# cmrunpkg <packagename>
Running package <packagename> on node node2
The package script for <packagename> failed with no restart. <packagename> should not be restarted
Unable to run package <packagename> on node node2
Check the syslog and pkg log files for more detailed information
cmrunpkg: Unable to start some package or package instances.

Its same also when we try to bring up the package on the other node.

Cause

When we look at to the logs file locate in /usr/local/cmcluster/run/log/<packagename>.log, below errors found:

Sep 20 00:09:03 – Node “node2”: Exporting filesystem on /opt/apps
exportfs: internal: no supported addresses in nfs_client
exportfs: <ip_address>:/opt/apps: No such file or directory

exportfs: internal: no supported addresses in nfs_client
exportfs: <ip_address>:/opt/apps: No such file or directory

exportfs: internal: no supported addresses in nfs_client
exportfs: <ip_address>:/opt/apps: No such file or directory

exportfs: internal: no supported addresses in nfs_client
exportfs: <ip_address>:/opt/apps: No such file or directory

exportfs: internal: no supported addresses in nfs_client
exportfs: <ip_address>:/opt/apps: No such file or directory
ERROR: Function export_fs
ERROR: Failed to export -o rw @nfs1:/opt/apps
Sep 20 00:09:04 – Node “node2”: Unexporting filesystem on @nfs1:/opt/apps

## Failed to start package <packagename>, rollback steps #####
Sep 19 23:44:20 [email protected] tkit_module.sh[32107]: Install directory operation mode selected.
WARNING: Stoping rmtab synchronization proces: /usr/local/cmcluster/conf/<packagename>/sync_rmtab.PID does not exist
Sep 19 23:44:20 – Node “node2”: Unexporting filesystem on @nfs1:/opt/apps
exportfs: Could not find ‘@nfs1:/opt/apps’ to unexport.
ERROR: Function un_export_fs
ERROR: Failed to unexport @nfs1:/opt/apps

Sep 20 00:09:05 [email protected] master_control_script.sh[31933]: ###### Failed to start package for <packagename> ######

Check the status of services of nfs.

[[email protected] ]# /etc/init.d/nfs status
rpc.svcgssd is stopped
rpc.mountd is stopped
nfsd is stopped
rpc.rquotad is stopped
[[email protected]]#

The reason why the cluster packages wont start up is because the service of nfs is stopped and those need to be running up.

 

Resolutions

We may start the nfs services;

[[email protected]]# /etc/init.d/nfs start
Starting NFS services: [ OK ]
Starting NFS quotas: [ OK ]
Starting NFS mountd: rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
rpc.mountd: svc_tli_create: could not open connection for udp6
rpc.mountd: svc_tli_create: could not open connection for tcp6
[ OK ]
Starting NFS daemon: rpc.nfsd: address family inet6 not supported by protocol TCP
[ OK ]
Starting RPC idmapd: [ OK ]

Verify the nfs service;
[[email protected]]# /etc/init.d/nfs status
rpc.svcgssd is stopped
rpc.mountd (pid 17790) is running…
nfsd (pid 17810 17809 17808 17807 17806 17805 17804 17803) is running…
rpc.rquotad (pid 17773) is running…

Then, the package can be run;
[[email protected]]# cmrunpkg <packagename>
Running package <packagename> on node node2
Successfully started package <packagename> on node node2
cmrunpkg: All specified packages are running
[[email protected]]#

Lastly, verify the status of packages in the cluster;

[[email protected] ~]# cmviewcl

CLUSTER STATUS
<clustername> up

SITE_NAME Site1_pri

NODE STATUS STATE
node1 up running

SITE_NAME Site2_sec

NODE STATUS STATE
node2 up running

PACKAGE STATUS STATE AUTO_RUN NODE
<packagename> up running disabled node2

##################################################

Unable to Change Directory to the Mount Point as Root – Permission Denied on HP-UX

Hello… i will show you how to solve the issue of permission denied when you find “permission denied” when trying to change directory to the specific directory. Below is some of the example and already become as root:

# cd /usr/local/sap/tools/
ksh: /usr/local/sap/tools/: permission denied
# ll /usr/local/sap/tools/
/usr/local/sap/tools/ not found
#

when i trying to display all the mountpoints, there were no mountpoint that i want to change to except for /usr, but i believe the abovementioned directory is not using /usr, but must be coming from external network.  On top of that, changing mod to the directory also not working as well as per below example:

# pwd
/usr/local
# chmod 755 sap/
chmod: can’t change sap/: Permission denied

When i see the mounted partition in a working server, i can see the mount point as nfs and imported from nfs server, please see below:

tools-x.xx.xxx.net:/usr/local/sap
4145152 2287273 1741912 57% /usr/local/sap

In order to get clarified, i have to see the properties of exported mount points on the nfs server:

#showmount -e <nfs_server>
export list for <nfs_server>:
/usr/local/sap (everyone)

So, from the above result, i know that mount point should be accessible and mounted by everyone and no issue if we want to mount it from the client side.

Cause

The issue is when i try to mount the nfs on client side, the error show up as device busy:

# mount <nfs_server>:/usr/local/sap /usr/local/sap
nfs mount: /usr/local/sap: Device busy

And i can see the above mount point been mounted:

# mount |grep -i ‘local/sap’
/usr/local/sap on /etc/auto_direct ignore,direct,dev=4000044 on Fri Aug 31 15:49:08 2018

Resolution

This can be resolved by unmount first the partition and mount it back accordingly. You may verify the mount point by using ‘bdf’ command as per below example:

# umount /usr/local/sap; # mount tools-<nfs_server>:/usr/local/sap  /usr/local/sap
# bdf

tools-ent.<nfs_server>:/usr/local/sap
4145152 2287273 1741912 57% /usr/local/sap

Lastly, you also may change directory to the above partition and list down its files without any problem.