3. Goal-oriented Guides
3.1. Detailed Setup for Developers
This document is structured with suggested sequences of steps at the top, and the various processes you need underneath. Use the suggested sequences or create your own as appropriate.
Many of the Processes in here have been come to over years of trying different approaches, in the main our philosophy has been to adopt the least-worst option, minimising user surprise in the least number of the simplest possible steps, even if that results in a more complex implementation for us in code where few users ever tread.
We welcome all suggestions and discussions for improvement, but if you’re going to tell us we’re 'using Docker Wrong' then we’d greatly appreciate an explanation on how you’d improve it.
3.1.1. Developers setup on OSX
The Developer onboarding is aimed at those who want to modify the automation tooling itself to change behaviors.
-
Prepare your OSX Deps
-
Follow the Main Setup
-
(Optional) Setup Commit Signing for Contributions
-
(Optional) Get setup to Develop the Collections themselves
3.1.2. Developer setup on Windows
Because Windows already needs to set up Linux compatibility to make containers work, we’re just going to leverage that across the board for simplicity.
Important
|
You must use WSL2 or you are going to get some really weird errors. |
We assume you have a recent version of 64bit Windows 10 with Administrator access and virtualization enabled, if you are restricted by your org from doing this then they should have a replacement process as this is pretty standard stuff for developers.
-
Install WSL2
-
Handle Line Endings on Windows
-
Follow the Main Setup
-
(Optional) Setup Commit Signing for Contributions
-
(Optional) Get setup to Develop the Collections themselves
3.2. Process Guides
3.2.1. Main Setup Process
This Setup Process is simply the Quickstart from the Cloudera-Deploy Readme with more steps and explanations.
Prerequisites
Note
|
It is easiest to test your credentials within the Runner, as the dependencies are already installed for you |
-
Cloud Credentials (CDP Public) and/or Host Infrastructure (CDP Private)
-
If you are deploying to a public cloud, you will need an active credential for that provider with rights to deploy resources for the Deployment you specify.
-
Unless otherwise instructed your ‘default’ credential will be used for the cloud provider.
-
You can check this with
aws sts get-caller-identity
and comparing outputs to the AWS UI, if using AWS -
For Azure, consider
az account list
, or if you need to refresh your credentials,az login
. -
For GCP, check that your Service Account credential is being picked up in the Cloudera Deploy Profile, and then test with
gcloud auth list
-
-
CDP Credentials (CDP Public)
-
If you are deploying CDP Public Cloud, you will need an active credential within a CDP Tenant.
-
Unless otherwise instructed, your ‘default’ credential and tenant will be used.
-
You can check this with
cdp iam get-user
and comparing outputs to the CDP UI
-
-
Code Deps:
-
Docker and Git installed locally (quick option)(See Setup Docker)
-
or - Ability to install Python and CLI dependencies on the machine of your choice (slow option)(See Centos7 Script for an example)
-
Get the Repo
-
Open a terminal and switch to your preferred projects directory
cd ~/Projects
-
Clone or update the repo
git clone https://github.com/cloudera-labs/cloudera-deploy.git && cd cloudera-deploy && chmod +x ./quickstart.sh
Note
|
If you already have the repo cloned, don’t forget to update it |
Get the Runner
-
Have Docker Installed and running
-
(Optional) Set your provider to download a reduced image and save disk space
-
AWS
export provider=aws
-
Azure
export provider=azure
-
GCP
export provider=gcp
-
Default is all of them, but it’s a ~2.3GB image download
-
-
Run the quickstart script in the Repo you just cloned above
-
Option A: Run the quickstart bare, it will mount the parent directory of the cloudera-deploy repo, which will be your code directory if you followed the directions above, e.g.
~/Projects
./quickstart.sh
-
Option B: Include an explicit path on your local machine, it will be mounted in the /runner/project folder in the container. If your code or Definitions are not in this path you may not be able to access them in the container.
-
Don’t worry if you get it wrong, just stop the container and rerun quickstart and it’ll make a new one with the new mount.
./quickstart.sh ~/code
-
-
This will drop you into a new shell in the container with all dependencies resolved for working with CDP and Cloud Infrastructure, you’ll know it has worked because the shell prompt is Orange with
cldr <version> #>
-
It will automatically load your local machine user profile so you have access to your credentials, ssh keys, etc.
-
If you run the quickstart script again, it’ll simply create another bash session on the container, providing useful parallelism
-
If you stop the container, the next time you run quickstart it will be updated and recreated, so any changes within the container filesystem and not persisted back to your Project directory or mounted user profile will be lost
-
As long as you run commands from within the /runner path, it will log your Ansible back to ~/.config/cloudera-deploy/log
-
-
If you already have a CDP Credential in your local user profile, you can test it with
cdp iam get-user
-
It will use the default CDP credential, or you can use a different profile by setting the CDP_PROFILE environment variable, or setting cdp_profile in your cloudera-deploy Definition files.
-
You should compare the UUID of your user returned by this command in the terminal with the UUID of your user reported in the User Profile in the CDP UI so you are certain that you are deploying to the expected Tenant
-
-
Check you have a credential for your chosen Cloud Infrastructure Provider. The default is AWS, and again you can provide a specific profile or use your default. You can check it by running
aws sts get-caller-identity
-
You should likewise compare the Account ID reported here with the Account ID in the AWS IAM UI to ensure you are targeting the expected Account. This is similar for other providers.
-
Prepare your Profiles to run a Deployment
Note
|
that you should execute any Ansible commands from /runner in the Runner, as it has all the defaults set for you and it may fail to find dependencies otherwise. |
Note
|
If you have different settings for different deployments you can create additional profile files under the directory above to store the different configurations. To use an alternative cloudera-deploy profile, specify the -e profile=<profile_name> option when running the ansible-playbook command.
|
-
Edit the default user Profile to personalise the Password, Namespace, SSH Keys, etc. for your Deployments. Note that this file is created the first time you run the quickstart.
vi ~/.config/cloudera-deploy/profiles/default
-
You will need CDP Credentials, and Credentials for your target Infrastructure of choice. Fortunately the Runner has most of these dependencies available to you
-
CDP with pre-issued Keys and optional profile
cdp configure --profile default
-
AWS with pre-issued Keys
aws configure --profile default
-
AWS SSO requires awscliv2 which is not installed in the Runner by default
-
-
Azure via interactive login
az login
-
Google Cloud via init
gcloud init
-
Deployment Run Commands
Provided you have completed the prerequisites to set up a cloud provider credential, and CDP Public Cloud credential, and your Cloudera-Deploy Profile, the following command creates a default Sandbox with CDP Public & Private Cloud in your default CDP Tenant and Cloud Infra Provider with no further interaction from the user:
ansible-playbook project/cloudera-deploy/main.yml -e "definition_path=examples/sandbox" -t run,default_cluster
Note
|
So that is three dependencies, and two commands, and you have a complete Hybrid Cloud Data Platform |
The command is structured typically for an ansible-playbook.
-
If you have used quickstart.sh to mount a local project directory with your definitions.yml and application.yml into the runner container (as explained in the steps in Get the Runner above), then you won’t have your own main.yml close to hand. You can instead use the default main.yml in /opt/cloudera-deploy/, and reference your project dir at /runner/project:
ansible-playbook /opt/cloudera-deploy/main.yml -e "definition_path=/runner/project/<your definition path>" -t <your tags>
-
Note that we pass in some ansible 'extra variables' using the -e flag. The only required variable points to the definition_path which contains the files that describe the deployment you want. You can provide many more extra vars, or even files directly on the command-line per usual Ansible options.
-
Note that we pass in Ansible Tags with
-t
, in this case the tags instruct Cloudera-Deploy to build CDP Public Cloud to the 'Runtimes' level, and also deploy a 'default' or basic CDP Base Cluster on EC2 machines in the same VPC. There are many other tags that may be used to control behavior, and are explained elsewhere.
-
Teardown and delete everything related to this definition:
ansible-playbook project/cloudera-deploy/main.yml -e "definition_path=examples/sandbox" -t teardown
Warning
|
This will teardown everything related to this definition and name_prefix, make sure that is actually what you want to be doing before running it.
|
-
Just deploy a CDP Public Datalake:
ansible-playbook project/cloudera-deploy/main.yml -e "definition_path=examples/sandbox" -t plat
Note
|
This uses the same definition, but then uses a different Ansible Tag to only deploy part of it, more explanation of the Definitions and Tags will follow elsewhere in the Architecture Docs. |
-
Just deploy CDP Private Cluster Trial on Public Cloud Infra:
ansible-playbook project/cloudera-deploy/main.yml -e "definition_path=examples/sandbox" -t infra,default_cluster
Note
|
This leverages the dynamic inventory feature to make a simple cluster on EC2 instances on AWS without the user needing to learn how first, and is very handy for trials and platform testing |
3.2.2. Refresh your Cloudera-Deploy local repo
If you have previously used Cloudera-Deploy but haven’t touched it in a while, here is a guide to refreshing your setup and getting results quickly
cd cloudera-deploy
git fetch --all
git pull
docker stop $(docker ps -qa)
./quickstart.sh
cdp iam get-user
aws iam get-user
3.2.3. Manual Ansible Controller setup on Centos7
We provide an example script for initialising an Ansible Controller on a Centos7 box here
3.2.4. Install Homebrew and Git on OSX
xcode-select --install
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install git
If you are going to use AWS SSO, you may also with to install awscliv2 on your local machine
brew install awscli@2
3.2.5. Setup Docker
Follow the instructions provided by Docker
Note
|
You want to be using WSL2, make sure Docker can see the underlying Ubuntu (or similar) for execution. You are advised to stick to the guide here, as we have found Windows throws some intractable filesystem and networking errors with creative linux-on-Windows setups for Docker and SSH. |
Follow the instructions provided by Docker
3.2.6. Install Windows Subsystem for Linux (WSL2)
-
Enable Developer mode (on older versions of Win10)
-
Windows Settings > Update & Security
-
Select the Developer Mode radio button
-
You may not need to do this, don’t worry if it’s not there and the rest of the process works, as the latest releases don’t require you to do this to install Linux
-
If you did have to enable it, you may have to reboot (yay Windows)
-
-
Enable Windows Subsystem for Linux v2
-
Control Panel > Programs & Features > Turn Windows Features on and off
-
Tick the box for ‘Windows Subsystem for Linux’
-
Make sure you either setup WSL2 from the start, or do the upgrade process. WSL1 has some strange behaviors
-
You’ll probably have to reboot. Yay Windows!
-
-
Install Ubuntu 18 (other distros untested, including Ubuntu 20)
-
Try to do it from the Microsoft Store
-
Open the store (search store in launch bar)
-
Search for ‘Linux’ in the store
-
Select and install Ubuntu
-
-
If you can’t do it from the store, try this in the cmd.exe prompt
lxrun /install
-
It’ll ask you to set a username and password, keep it short and simple, doesn’t have to match your actual Windows user
-
-
You may have to set this as the default bash environment as Docker for Windows likes to steal it away after updates
-
List the currently installed distros by opening cmd.exe
-
wsl --list --all
-
Set ubuntu as default if it is not
-
wsl -s Ubuntu
-
3.2.7. Handle line-endings on Windows
Windows defaults to a different line ending standard than Linux/OSX.
Note
|
You only need to follow this step if you plan on editing the code on Windows. |
The Cloudera Labs repos standardise on linux line endings, the easiest way to handle this on Windows involves basically two steps.
-
Set git to always not use Windows line endings, so it doesn’t rewrite your files when you checkout
-
git config --global core.autocrlf false
-
Set your IDE to use Linux line endings, if not for everything then at least the Cloudera Labs projects
-
In Pycharm, this is in File > Settings > Editor > Code Style > General > Line Separator: set to ‘Unix and macOS (\n)
-
3.2.8. Setup GPG commit signing
Note
|
You can skip this step if you are not planning on contributing code |
Cloudera-Labs uses the Developer Certificate of Origin (DCO) approach to open source contribution, by signing your commits you are attesting that you are allowed to submit them by whoever owns your work (you, or your employer). We also require commit signing to validate the supply chain on our code.
There is a good explanation here, it also covers setup for machines with other OS.
This subguide assumes that you want to set up your development machine so that it automatically signs all your commits without bothering you too much. If you are just checking out the code to inspect it and not for contributing back to the community, then you can skip this step. You may want to modify this process to ask you for your passphrase or manually sign commits each time at your preference.
https://nifi.apache.org/gpg.html
https://stackoverflow.com/a/46884134
https://superuser.com/a/954536
https://withblue.ink/2020/05/17/how-and-why-to-sign-git-commits.html
OSX Catalina 10.15.7 on MacBook Pro 2019
-
Update or install Homebrew
-
Install dependencies
brew install gpg2 pinentry-mac && brew install --cask gpg-suite
-
gpg-suite is not strictly necessary, but it makes it easier to integrate signing with IDEs like IntelliJ as it helps you manage the passphrase in your OSX keychain
-
Pinentry-mac makes it easy to sign commits within your IDE, like Pycharm, without having to always commit via terminal
-
-
Create a directory for gnupg to store details
mkdir ~/.gnupg
-
Put the following in
~/.gnupg/gpg-agent.conf
default-cache-ttl 600 max-cache-ttl 7200 default-cache-ttl-ssh 600 max-cache-ttl-ssh 7200 pinentry-program /usr/local/bin/pinentry-mac
-
Enable it in your user profile such as ~/.bash_profile or ~/.zprofile
export GPG_TTY=$(tty) gpgconf --launch gpg-agent
-
Set correct permissions on your gnupg user directory
chown -R $(whoami) ~/.gnupg/ find ~/.gnupg -type f -exec chmod 600 {} \; find ~/.gnupg -type d -exec chmod 700 {} \;
-
Generate yourself a key
gpg --full-gen-key
-
Key type 4
-
Keysize 4096
-
Expiration 1y or 2y or whatever
-
Your real name, or github username
-
Your real email address, or the one github recognises as yours in your Settings
-
A Passphrase - don’t forget it, and make sure it is strong
-
-
Verify your key is created and stored
gpg2 --list-secret-keys --keyid-format SHORT
-
Copy your key ID, it’ll look something like
rsa4096/674CB45A
-
You want the second bit,
674CB45A
-
-
Test your key can be used
echo "hello world" | gpg2 --clearsign
-
You’ll have to enter your passphrase to sign it, then it’ll print the encrypted message
-
-
You may want to add multiple email addresses to the key signing, such as your open source email and/or your employer email and/or your personal email
-
Open your key for editing
gpg2 --edit-key <your ID here>
-
Then use the adduid command
adduid
-
Enter the identity Name and Email as before
-
Then update the trust for the new identity
uid 2 trust
-
You probably want trust 5
-
-
Save to exit
save
-
-
Configure Github to recognise your signed commits
-
Set your git email for making commits
git config --global user.email <your@email.com>
-
This email must be one of those in your GPG key set earlier
-
-
Export your public key for uploading to Github
gpg2 --armor --export <your ID here>
-
Copy everything including the following lines into your paste buffer
-----BEGIN PGP PUBLIC KEY BLOCK----- … -----END PGP PUBLIC KEY BLOCK-----
-
-
Open github and go to your key settings https://github.com/settings/keys
-
Add new GPG key, paste in your PGP block from your buffer
-
Configure git to autosign your commits
git config --global gpg.program gpg git config --global user.signingkey <your ID here> git config --global commit.gpgSign true git config --global tag.gpgSign true
-
Put the following in ~/.gnupg/gpg.conf
# Enable notty for IDE signing no-tty # Enable gpg to use the gpg-agent use-agent
-
Configure IntelliJ to sign your commits
-
This should be as simple as restarting your IDE once this process is complete
-
Then when you make a new commit you can tick the box to ‘Sign-off commit’ in the dialog box
-
-
Configure vsCode to sign commits
-
Find the following flag in the config and enable it
git.enableCommitSigning
-
3.2.9. Using the Ansible Runner Independent of Cloudera-Deploy
In order to minimise time spent on dependency management and troubleshooting issues arising from users on different systems, we provide a standardised container image. The image is prepared in such a way that you can use it as a shell, a python environment, a container on Kubernetes, within other CICD Frameworks, Ansible Tower, or simply as an ansible-runner.
-
OSX Catalina 10.15.7 on MacBook Pro 2019
-
Windows 10.0.19042 on Intel gaming rig (Tasteful RGB Edition)
-
To run this process on Windows you are expected to be within your WSL2 subsystem
-
Clone the Cloudera Labs Ansible Runner implementation repo into your preferred local Projects directory
git clone https://github.com/cloudera-labs/cldr-runner.git && cd cldr-runner
-
Linux only: Mark the run_project.sh script and build.sh script as executable
chmod +x ./run_project.sh chmod +x ./build.sh
-
Ensure Docker is running on your host machine
-
Copy the absolute path to the root of your code projects directory that contains the projects you want to execute within the container environment, e.g.
/Users/dchaffelson/Projects
-
Launch the runner targeting the project you want to execute by passing the absolute path as the argument to the run_project.sh script, e.g.
./run_project.sh /Users/dchaffelson/Projects
-
The script will build the container image from the latest release bits, this will take a few minutes the first time, the resulting image will be ~2GB
-
You will then be dropped into a shell session in directory /runner in the container environment. Your Project will be mounted at /runner/project. You will have all the currently known dependencies for working with CDP pre-installed with conflicts resolved
-
Note that the container must be stopped for a new project directory to be mounted to a new build, if there is already a container of the same name running you will just get a new shell session in it
-
-
At this point you may wish to install additional dependencies to the container, particularly those which may be behind VPN or on your corporate VCS.
ansible-galaxy install -r project/deps/ansible-deps.yml pip install -r project/deps/python-deps.txt
Note
|
By default, the container is recreated if stopped, but it will not stop if you close your shell session as it is held open by a background tty. Try not to kill that. |
3.2.10. Getting Started with Developing Collections
Note
|
You can skip this step if you only want to use the Collections to create your own playbooks. This step is setting up the to Develop the Collections themselves. |
This will guide you through setting up a directory structure convenient for developing and executing the Collections within the Runner, or other execution environments.
You only need to do this if you want to contribute directly to the Collections or Python clients underlying the interactions with Cloudera products - you do not need to go through this setup process if you simply wish to use cloudera-deploy with your own YAML Definitions, as the Collections and Clients should not need to be modified in those cases and are already pre-installed in the Runner.
Ansible expects to find collections within a path ‘collections/ansible_collections’ on a series of predefined or default paths within your environment. By default, the Runner has this Path variable prepopulated in a helpful fashion to the pre-installed Collections, this process guides you through modifying that to point at your own versions which you have to maintain yourself.
For development purposes, creating this path in your favourite coding Projects directory, and then checking out the collections under it and renaming them to match the expected namespace may seem slightly arcane but it is the lowest-friction method for ongoing development we have found over many years of doing this.
-
Make the directory tree Ansible expects in the same parent code directory that cloudera-deploy is in, e.g.
cd cloudera-deploy && mkdir -p ../ansible_dev/collections/ansible_collections/cloudera
-
cloudera is the base namespace of our collection
-
Your Projects directory should also have your Ansible Playbooks and other codebase in it, so that you can mount the root of it to the Runner and have access to all your codebase, e.g.
~/Projects/cloudera-deploy
should be where Cloudera-Deploy is located
-
-
Fork each of the sub-collections and cdpy into your personal github, and replace <myAccount> with your actual github account below
-
Checkout each of the sub-collections into this folder, e.g.:
cd ~/Projects git clone -b devel https://github.com/<myAccount>/cdpy.git cdpy cd ansible_dev/collections/ansible_collections/cloudera git clone -b devel https://github.com/<myAccount>/cloudera.exe.git exe git clone -b devel https://github.com/<myAccount>/cloudera.cloud.git cloud git clone -b devel https://github.com/<myAccount>/cloudera.cluster.git cluster
NoteThe cloned directories above must be named "exe", "cloud" and "cluster", respectively. Ensure you specify the directory name as the last parameter in the command line, as shown above. Each of the subcollections should be on the ‘devel’ branch so you can PR them back them with your changes -
Your Code Project directory should now look something like this:
/Projects/ansible_dev/collections/ansible_collections/cloudera /Projects/ansible_dev/collections/ansible_collections/cloudera/exe /Projects/ansible_dev/collections/ansible_collections/cloudera/cloud /Projects/ansible_dev/collections/ansible_collections/cloudera/cluster /Projects/cdpy /Projects/cloudera-deploy
-
Before you invoke quickstart.sh, set the environment variable below to tell Ansible where to find your code inside your execution environment once it is mounted in the Container at /runner/project:
export CLDR_COLLECTION_PATH="ansible_dev/collections" export CLDR_PYTHON_PATH=/runner/project/cdpy/src
NoteYou might want to set this in your bash or zsh profile on your local machine so it is persistent -
Then, when you run quickstart.sh in cloudera-deploy, it will pick up this extra Collection location and add cdpy to the PYTHONPATH, and use these instead of the release versions basked into the Container
-
You can confirm this is working by running this inside the Runner
ansible-galaxy collection list
It should look something like:
# /runner/project/ansible_dev/collections/ansible_collections
Collection Version
---------------- -------
cloudera.cloud 0.1.0
cloudera.cluster 2.0.0
cloudera.exe 0.0.1
Cloudera.runtime 0.0.1
# /home/runner/.ansible/collections/ansible_collections
Collection Version
-------------------- -------
amazon.aws 1.4.0
ansible.netcommon 1.5.0
ansible.posix 1.1.1
azure.azcollection 1.4.0
community.aws 1.4.0
community.crypto 1.4.0
community.general 2.1.1
community.mysql 1.2.0
community.postgresql 1.1.1
google.cloud 1.0.2
netapp.azure 21.3.0
If you see duplication of collections because you are using the runner AND mounting your own versions, you probably have not activated the CLDR_COLLECTION_PATH variable correctly, and thus quickstart.sh is not picking it up.
As another test, you should also be able to invoke python inside the container and use cdpy
from cdpy.cdpy import Cdpy
c = Cdpy()
c.iam.get_user()
To test that cdpy is present and you can access your account as on cmdline
You may now edit the collections or cdpy and have the changes immediately available for use within the Runner, which is an awful lot easier than having to compile and crossload them after every change.
3.2.11. Install Dependencies without using Runner
We prefer that you use the Runner, because it sets many defaults to avoid common issues and thus save you and us a lot of issue reproduction time. However, we understand that there are many situations where it may not be appropriate, such as air-gapped environments, or when you want to run the install locally on the hardware and not have a separate ansible controller.
-
Create a new virtualenv, or activate an existing one, we do not recommend installing dependencies in system python on most OS.
-
Install dependencies for your hosting infrastructure version following the pathway laid out in the Dockerfile in ansible-runner Install any additional dependencies you may have
Note
|
THe Dockerfile resolves combined dependencies for all our Hybrid Cloud Deployments, you probably only need a subset for your environment. |
3.2.12. Developing within the Runner
While we recommend using the Runner as your execution environment when doing development, actually developing directly against the Runner using something like Visual Studio may not be a great idea due to the file system latency commonly encountered.
Generally when the maintainers work with this system we are editing the files directly on our host system using our IDE, and those files are RW mounted into the container for execution via the /runner/project
mechanism, which then does not noticeably incur any performance degradation.
3.2.13. Working with AWS SSO
Traditionally AWS users would use static keys, but more recently using temporary credentials via SSO is more commonplace.
The upside of AWS SSO is better credential management, the downside being additional complexity, reliance on the fairly awful AWSCLIv2, and a lack of OOTB automation integration.
-
Install awscli v2 on your local machine
brew install awscli
-
Check you have version 2 (currently 2.5.4)
aws --version
-
Login to AWS SSO via Okta (or however), examine the list of accounts and select the one you want to set up, copy the name to use as your AWS Profile name
-
Setup and login to AWS SSO for your selected account
aws configure sso --profile <selected name here>
-
Enter your Start URL, e.g.
https://<app-name>.awsapps.com/start
-
SSO Region, e.g.
us-east-1
-
It’ll launch your browser (which is why we do it on your local machine)
-
Complete the login process to authenticate the session
-
Select which AWS account you wish to set up
-
Set your default region for this profile, e.g.
us-east-1
-
Set your default output for this profile, e.g.
json
-
-
Now whenever you run a deployment you must set the profile name to match the one you have setup, e.g. in definition.yml
aws_profile: my-aws-profile
Note
|
You will likely have to re-login to SSO before each run |
3.2.14. Working with named credential profiles
For CDP, AWS, and other credential-based access controls it is common to have multiple accounts that may be used from time to time. As such, it is useful to be able to easily, safely, and precisely switch between these accounts.
This is typically achieved through use of multiple named profiles, a good introductory guide to this is here for AWS.
To set up a named profile for CDP, follow this guide to create an API access key, and then this guide to configure your local credentials by simply adding the --profile <profile name>
flag to the commands
Credentials are typically stored in a dot-prefix directory in the user profile on Linux, and in the user home directory on Windows, e.g. ~/.aws/credentials
, ~/.cdp/credentials
etc. They may be directly edited in these files if you prefer instead of using the configure
CLI commands.
By default, Cloudera-Deploy will use the default
profiles for each account touched, you can set the Definition yaml keys cdp_profile: <profile name>
and/or aws_profile: <profile name>
to use a specific named profile instead.
-
There should always be a
default
credential which points to a 'safe' account for when you deliberately or accidentally run commands without specifying a profile. This is generally a dev account, or even an expired credential, to ensure you don’t accidentally delete prod. -
It is typical with CDP Public Cloud to pair a CDP Tenant with specific Cloud Infrastructure accounts to avoid confusion, therefore if you have a
dev
andprod
CDP tenant with profiles of the same name, we recommend the paired cloud infrastructure accounts are also nameddev
andprod
.
3.2.15. Using a Jumpbox
A common deployment scenario is to funnel all cluster access through a jump/bastion host.
In this case, there are three possibilities:
-
To run the Ansible Runner from the jump host itself
-
To deploy the dependencies within the boundary of the Jump Host
-
To run the Ansible Runner locally and tunnel connections through the jump host.
In scenario 3, the following will be necessary to tunnel both SSH connection and HTTP calls through the jump host.
In the runner, edit /runner/env/envvars
and add http_proxy=<proxy>
where<proxy> is the name:port of your http proxy (e.g. a SOCKs proxy running on localhost).
Alternatively, edit quickstart.sh to pass this value through from your local machine if it is available.
In your inventory file, under the [deployment:vars]
group, add the following variable to set additional arguments on the SSH command used by Ansible.
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ProxyCommand="ssh -W %h:%p -q <jump host>"'
Optionally, in your SSH config file (e.g. ~/.ssh/config) you can configure an alias with predefined parameters for the jump host. This makes it easier to manage between different deployments and makes the argument string easier to read.
Host myJump
IdentityFile ~/.ssh/myKey
StrictHostKeyChecking = no
User myUser
HostName jump.host.name
With this SSH config the proxy string would look like this:
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ProxyCommand="ssh -W %h:%p -q myJump"'
3.3. CDP Private Cloud Deployments
CDP Base represents deployment of 'Traditional' clusters via Cloudera Manager, which you might think of historically as hadoop or big data clusters. In practice, there is an Ansible Collection called cloudera.cluster
which is excellent at configuring Cloudera Manager across various versions of Linux regardless of whether that OS is running in Containers, Virtual Machines, Cloud Infra like EC2, or old school baremetal nodes.
Cloudera.cluster is wrapped within Cloudera-Deploy, so again in practice you are most likely to simply use Cloudera-Deploy as the entrypoint to drive cloudera.cluster
to create the cluster you want via the Definition mechanism explained in these docs.
The differences are that Cloudera.cluster has been built over many generations to support Cloudera Manager versions 5 to 7, and supports many legacy options and behaviors for user convenience. The result of this is that, while simple deployments are as simple as the more modern CDP Public Cloud, complex deployments will require a greater familiarity with Ansible and Cloudera Manager templates.
3.3.1. Ansible Controller options
We still strongly recommend that you use the cloudera-runner Docker Container shipped with Cloudera-Deploy as your Ansible controller for CDP Base deployments, because the consistency and logging is going to save you time and effort.
However, it is not uncommon to use the target Cloudera Manager node as the Controller. It is also not uncommon to need to deploy to air-gapped environments or environments where Docker is not permitted.
As such, there is a script for deploying the dependencies for Cloudera-Deploy on centos7 as an example in the repo for Power Users in these scenarios. Note that the script covers dependencies for all Cloudera Products on all current Clouds, so you may wish to exclude parts irrelevant to your scenario.
Note
|
It is worth noting that the admin_password field from your Profile, as outlined in the Quickstart, will be used as the Cloudera Manager admin password during cloudera.cluster deployments.
|
3.3.2. Inventories
An inventory file can be supplied using the traditional -i flag at runtime or by including either inventory_static.ini or inventory_template.ini in the definitions path.
An Ansible inventory covering the hosts in-scope for the deployment. An inventory with this name will be picked up automatically if present without using the -i
to specify the inventory file at runtime.
Common inventory group names are:
-
cloudera_manager
-
cluster_master_nodes
-
cluster_worker_nodes
-
cluster_edge_nodes
Each host in these groups should be assigned one of the host templates that is defined in cluster.yml. These can be applied per host by setting host_template=TemplateName
after the hostname, or per group by setting host_template=TemplateName
in the group vars. See here for more information on Ansible inventory vars.
Nodes that should be added to the CDP Cluster as defined in the cluster.yml must be added as children to the cluster
group. For example, if all data nodes are present under the cluster_worker_nodes
group, then this group should be listed under cluster:children
. See here for an example of setting group children.
All nodes/groups should be added as children of the deployment
group [deployment:children]
. This includes the cluster
group, as well as all non-CDP Cluster groups, for example, Cloudera Manager and supporting infrastructure nodes, perhaps running HAProxy or an RDBMS, that are deployed with the cluster but are not running CDP Roles. This allows for deployment-wide variables to be set, such as the common ssh key that can be used to access all nodes.
Warning
|
Do not use the [all:vars] group if you are using an independent Ansible Controller as Plays will also be applied to it via the localhost mechanic. This includes from your Laptop or the Ansible-Runner supplied in Cloudera-Deploy.
|
3.3.3. Process Overview
In this section we will walk through a typical deployment, commenting on options at various points
1. Prepare a Definition Path
You are advised to create a working directory for your definition in your preferred Projects directory.
If you have checked out Cloudera-Deploy from Github following the Quickstart, then you could make a copy of the folder cloudera-deploy/examples/sandbox
into the same base code Projects directory, and use it as a starting point.
~/Projects/cloudera-deploy # contains quickstart.sh, main.yml
~/Projects/my-definition # copied from ~/Projects/cloudera-deploy/examples/sandbox
2. Infrastructure and Inventory
You are advised to prepare your nodes in advance by whatever mechanism you prefer, and then include them in an Ansible Inventory matching the shape of the Cluster you want to deploy.
You may use Cloudera-Deploy to automatically provision EC2 instances based on the inventory template, but this is intended for test and training purposes. An example inventory template is included in the Cloudera-Deploy Sandbox Example. There is also a copy of the static_inventory.ini
file in this directory.
If you want to use a static inventory, you may use any of the typical Ansible inventory approaches such as -i
on the cmdline, the Inventory path, or including a file inventory_static.ini
in your Definition Path which Cloudera-Deploy will automatically pick up.
You may wish to include additional variables in the Inventory, most typically the ssh user and private key. These should probably be in the [deployment:vars]
section.
[deployment:vars]
ansible_ssh_private_key_file=~/.ssh/root_key
ansible_user=root
Note
|
When you run quickstart.sh to launch the Runner Container, it will mount your local user ~/.ssh directory into the container, so there is no need to copy keys around provided you use bash expansion for the user home directory.
|
3. Cloudera-Deploy and the Runner
If you have not already done so, you should follow the instructions to setup your OS dependencies and get a copy of Cloudera-Deploy. New users are encouraged to use the Quickstart, while Power Users who want to immediately understand the inner workings may wish to follow selected parts of the Developers guide.
Regardless of which method you select, complete the setup process until you are in the orange coloured interactive shell. Ensure that /runner/project within the shell has mounted your Projects directory that contains the Definition working directory you created in step one.
Important
|
At this point you should ensure that the directory listing in /runner/project within the Runner matches your projects directory on your local machine, we earlier gave the example of ~/Projects for this.
|
When using the Ansible Runner, you should be aware that most changes to the filesystem within the container are not persistent - that is, when you kill the container, most changes will be lost. When we are working on files that we want to persist, such as cluster templates, we can keep these on the local filesystem outside of the container, and mount the local filesystem directory inside the container, which is the purpose of the /runner/project mountpoint. This means that we can read and write to those files both inside and outside of the container and you can safely terminate your container without losing them.
Apart from the working dir, the container will mount specific directories out of your local user profile such as ~/.ssh, ~/.config, ~/.aws etc. into the container to pass-through credentials and other requirements. You should assume that any changes outside of these profile directories and /runner/project are not persisted.
Note
|
If we do not specify which directory to mount, the quickstart.sh script will mount the parent directory that the current session is in. i.e. if you run quickstart.sh from the path ~/Projects/cloudera-deploy , then ~/Projects in the host filesystem will be mounted to /runner/projects in the Container.
|
Note
|
If you did not stop the container previously, quickstart.sh will ignore any arguments passed and simply give you a new terminal session in the existing container. Therefore, if you want to change the mounted directory you must stop and restart the container with the new path. |
3.3.4. Executing Playbooks
To run the playbook, we use the ansible-playbook
command. The entry point to the playbook is in main.yml in cloudera-deploy. With our mount point in /runner/project, the full path is /runner/project/cloudera-deploy/main.yml
. This is passed as the first argument to ansible-playbook.
We also need to provide some additional arguments. We do this with the -e flag. The first argument to pass is the definition_path. This is the path to the directory that contains our definition, inventory, etc.
Note
|
It is usually a good idea to use Cloudera-Deploy to verify your infrastructure before committing to the full deployment, by using the same main playbook and adding the ‘verify’ tag, as below. This is particularly handy for real-world deployments for a quick sanity check. |
An example command is shown below. Notice that we do not need to specifically provide an inventory list. The playbook will look in the definition_path for an inventory file, which is included in the cloudera-deploy-definitions examples. Of course, you can provide an inventory file using the -i if you want.
Note
|
Commands like ansible-playbook should be run from the /runner directory in the Ansible-Runner to pick up the defaults and other useful configurations
|
ansible-playbook /runner/project/cloudera-deploy/main.yml \
-e "definition_path=/runner/project/my-definition/" \
-t verify
-v
Following this will be a lot of output into the terminal which tracks the stages of our deployment.
PLAY [Init Cloudera-Deploy Run] ******************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ***************************************************************************************************************************************************************************************************************************************
Thursday 13 May 2021 18:54:13 +0000 (0:00:00.014) 0:00:00.014 **********
ok: [localhost]
…...
To run the actual full deployment against your inventory, the most common tag to use is full_cluster
. The complete listing of all tags can be found by reviewing the Plays in cluster.yml in cloudera-deploy.
ansible-playbook /runner/project/cloudera-deploy/main.yml \
-e "definition_path=/runner/project/my-definition/" \
-t full_cluster
-vv
Expect the deployment to take 30 to 90 minutes or longer, assuming you encounter no errors.
If you do run into issues, most runs are idempotent so you can re-run with increased verbosity of the terminal output by adding the ‘-v’ flag to the ansible-playbook command. You can scale the verbosity by adding more v’s up to ‘-vvvvv’ for maximum verbosity.
There is nothing you need to do until the playbook completes, however it can be useful to have a scroll through the output and get a feel for what it is doing.
Eventually, you will get to some output that looks like the following. This indicates that Cloudera Manager is being installed, and then a check runs to wait for the server to start responding. When you get past this step, you’ll be able to access the CM UI in your browser.
It will be installed on the host that was under the cloudera_manager title in your inventory, and on port 7180 for HTTP and 7183 for HTTPS. The username is typically 'admin', and the password will be the admin_password you set in your Profile or Definition.
TASK [cloudera.cluster.server : Install Cloudera Manager Server] *********************************************************************************************************************************************************************************************
Thursday 13 May 2021 19:41:50 +0000 (0:00:06.555) 0:47:36.812 **********
….
RUNNING HANDLER [cloudera.cluster.common : wait cloudera-scm-server] *****************************************************************************************************************************************************************************************
Thursday 13 May 2021 19:42:45 +0000 (0:00:09.338) 0:48:31.682 **********
The next important step to watch out for comes right at the end of the playbook. This is the Import Cluster Template step. In this step, the playbook is using the CM API to insert our cluster template, which allows CM to handle the complex logic of deploying the software, pushing out configurations and completing initializations and first runs. During this step, you will not see much useful output in the terminal.
Instead, you should go inside the CM web UI and go to the ‘Running Commands’ page, where you will be able to drill down into the ‘Import Cluster Template’ command and watch the individual steps that CM performs. This is the best place to debug any errors that you might encounter during the Import Cluster Template step.
TASK [cloudera.cluster.cluster : Import cluster template]
******************************************************
Note
|
Deploying parcels can take some time if downloading directly from Cloudera Repos over slow or long-distance connections. Consider using the local repo options if doing multiple builds. |
After the Template is imported, the First Run is completed, and then a cluster Restart command will run.
In the terminal session, our playbook has now completed and we will see the results at the end of the output. We should see a success message and a quick recap of the steps it took.
TASK [Deployment results] ***************************************************************************************************************
Thursday 13 May 2021 20:59:45 +0000 (0:00:00.287) 2:05:31.793 **********
ok: [localhost] => {
"msg": "Success!"
}
PLAY RECAP ***************************************************************************************************************
ccycloud-1.cddemo.root.hwx.site : ok=162 changed=49 unreachable=0 failed=0 skipped=151 rescued=0 ignored=0
ccycloud-2.cddemo.root.hwx.site : ok=71 changed=23 unreachable=0 failed=0 skipped=65 rescued=0 ignored=0
ccycloud-3.cddemo.root.hwx.site : ok=71 changed=23 unreachable=0 failed=0 skipped=65 rescued=0 ignored=0
ccycloud-4.cddemo.root.hwx.site : ok=71 changed=23 unreachable=0 failed=0 skipped=65 rescued=0 ignored=0
localhost : ok=173 changed=11 unreachable=0 failed=0 skipped=149 rescued=0 ignored=1
Thursday 13 May 2021 20:59:45 +0000 (0:00:00.064) 2:05:31.857 **********
===============================================================================
cloudera.cluster.cluster : Import cluster template --------------------------------- 4132.25s
cloudera.cluster.daemons : Install Cloudera Manager daemons package ---------------- 1415.92s
cloudera.cluster.user_accounts : Create local user accounts ------------------------ 294.11s
cloudera.cluster.user_accounts : Set home directory permissions -------------------- 254.82s
cloudera.cluster.common : wait cloudera-scm-server --------------------------------- 99.33s
cloudera.cluster.agent : Install Cloudera Manager agent packages ------------------- 64.32s
cloudera.cluster.os : Populate service facts --------------------------------------- 60.19s
cloudera.cluster.jdk : Install JDK ------------------------------------------------- 50.16s
cloudera.cluster.krb5_server : Install KRB5 server --------------------------------- 39.23s
geerlingguy.postgresql : Ensure PostgreSQL packages are installed. ----------------- 38.81s
cloudera.cluster.cluster : Restart Cloudera Management Service --------------------- 35.66s
cloudera.cluster.mgmt : Start Cloudera Management Service -------------------------- 34.83s
cloudera.cluster.krb5_client : Install KRB5 client libraries ----------------------- 34.25s
cloudera.cluster.kerberos : Import KDC admin credentials --------------------------- 25.34s
Gather facts from connected inventory ---------------------------------------------- 20.92s
cloudera.cluster.krb5_server : Start Kerberos KDC ---------------------------------- 19.53s
cloudera.cluster.deployment/repometa : Download parcel manifest information -------- 19.00s
cloudera.cluster.os : Install rngd ------------------------------------------------- 18.34s
cloudera.cluster.rdbms : Copy SQL to change template to UTF-8 ---------------------- 16.64s
Gathering Facts -------------------------------------------------------------------- 16.31s
cldr base-v1.0.2 #>
From this, we can see that the build took 2:05:31.793 (2 hours 5 minutes) in total, around 1 hour of this was the Import Cluster Template which includes the parcel downloads. Pre-downloading and hosting a cluster-local parcel repository can speed this up dramatically.
3.4. Contributing to these Docs
Contributions to improve the docs are welcomed at any time.
We encourage you to either directly create a PR for the changes, or open an Issue on Github.
- In order to contribute your changes, please consider the following process
-
-
Fork the repo to your Github account, and create a branch for your changes
-
Make your edits in your checked out copy
-
Install AsciiDoctor locally to build and check your changes
-
(Optional) Install Make to use the Makefile for convenient builds
-
Build the index.html root using
make
in the repository root -
Review the docs by opening index.html in your browser
-
Create a PR to
main
with your changes when they are ready for publication
-