Planet RDO

March 19, 2018

Giulio Fidente

Ceph integration topics at OpenStack PTG

I wanted to share a short summary of the discussions happened around the Ceph integration (in TripleO) at the OpenStack PTG.

ceph-{container,ansible} branching

Together with John Fulton and Guillaume Abrioux (and after PTG, Sebastien Han) we put some thought into how to make the Ceph container images and ceph-ansible releases fit better the OpenStack model; the container images and ceph-ansible are in fact loosely coupled (not all versions of the container images work with all versions of ceph-ansible) and we wanted to move from a "rolling release" into a "point release" approach, mainly to permit regular maintenance of the previous versions known to work with the previous OpenStack versions. The plan goes more or less as follows:

  • ceph-{container,ansible} should be released together with the regular ceph updates
  • ceph-container will start using tags and stable branches like ceph-ansible does

The changes for the ceph/daemon docker images are visible already:

Multiple Ceph clusters

In the attempt to support better the "edge computing" use case, we discussed adding support for the deployment of multiple Ceph clusters in the overcloud.

Together with John Fulton and Steven Hardy (and after PTG, Gregory Charot) we realized this could be done using multiple stacks and by doing so, hopefully simplify managament of the "cells" and avoid potential issues due to orchestration of large clusters.

Much of this will build on Shardy's blueprint to split the control plane, see spec at:

The multiple Ceph clusters specifics will be tracked via another blueprint:

ceph-ansible testing with TripleO

We had a very good chat with John Fulton, Guillaume Abrioux, Wesley Hayutin and Javier Pena on how to get tested new pull requests for ceph-ansible with TripleO; basically trigger an existing TripleO scenario on changes proposed to ceph-ansible.

Given ceph-ansible is hosted on github, Wesley's and Javier suggested this should be possible with Zuul v3 and volunteered to help; some of the complications are about building an RPM from uncommitted changes for testing.

Move ceph-ansible triggering from workflow_tasks to external_deploy_tasks

This is a requirement for the Rocky release; we want to migrate away from using workflow_tasks and use external_deploy_tasks instead, to integrate into the "config-download" mechanism.

This work is tracked via a blueprint and we have a WIP submission on review:

We're also working with Sofer Athlan-Guyot on the enablement of Ceph in the upgrade CI jobs and with Tom Barron on scenario004 to deploy Manila with Ganesha (and CephFS) instead of the CephFS native backend.

Hopefully I didn't forget much; to stay updated on the progress join #tripleo on freenode or check our integration squad status at:

by Giulio Fidente at March 19, 2018 02:32 AM

March 16, 2018

Adam Young

Generating a list of URL patterns for OpenStack services.

Last year at the Boston OpenStack summit, I presented on an Idea of using URL patterns to enforce RBAC. While this idea is on hold for the time being, a related approach is moving forward building on top of application credentials. In this approach, the set of acceptable URLs is added to the role, so it is an additional check. This is a lower barrier to entry approach.

One thing I requested on the specification was to use the same mechanism as I had put forth on the RBAC in Middleware spec: the URL pattern. The set of acceptable URL patterns will be specified by an operator.

The user selects the URL pattern they want to add as a “white-list” to their application credential. A user could further specify a dictionary to fill in the segments of that URL pattern, to get a delegation down to an individual resource.

I wanted to see how easy it would be to generate a list of URL patterns. It turns out that, for the projects that are using the oslo-policy-in-code approach, it is pretty easy;

cd /opt/stack/nova
 . .tox/py35/bin/activate
(py35) [ayoung@ayoung541 nova]$ oslopolicy-sample-generator  --namespace nova | egrep "POST|GET|DELETE|PUT" | sed 's!#!!'
 POST  /servers/{server_id}/action (os-resetState)
 POST  /servers/{server_id}/action (injectNetworkInfo)
 POST  /servers/{server_id}/action (resetNetwork)
 POST  /servers/{server_id}/action (changePassword)
 GET  /os-agents
 POST  /os-agents
 PUT  /os-agents/{agent_build_id}
 DELETE  /os-agents/{agent_build_id}

Similar for Keystone

$ oslopolicy-sample-generator  --namespace keystone  | egrep "POST|GET|DELETE|PUT" | sed 's!# !!' | head -10
GET  /v3/users/{user_id}/application_credentials/{application_credential_id}
GET  /v3/users/{user_id}/application_credentials
POST  /v3/users/{user_id}/application_credentials
DELETE  /v3/users/{user_id}/application_credentials/{application_credential_id}
PUT  /v3/OS-OAUTH1/authorize/{request_token_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}/roles/{role_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}/roles
DELETE  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}

The output of the tool is a little sub-optimal, as the oslo policy enforcement used to be done using only JSON, and JSON does not allow comments, so I had to scrape the comments out of the YAML format. Ideally, we could tweak the tool to output the URL patterns and the policy rules that enforce them in a clean format.

What roles are used? Turns out, we can figure that out, too:

$ oslopolicy-sample-generator  --namespace keystone  |  grep \"role:
#"admin_required": "role:admin or is_admin:1"
#"service_role": "role:service"

So only admin or service are actually used. On Nova:

$ oslopolicy-sample-generator  --namespace nova  |  grep \"role:
#"context_is_admin": "role:admin"

Only admin.

How about matching the URL pattern to the policy rule?
If I run

oslopolicy-sample-generator  --namespace nova  |  less

In the middle I can see an example like this (# marsk removed for syntax):

# Create, list, update, and delete guest agent builds

# This is XenAPI driver specific.
# It is used to force the upgrade of the XenAPI guest agent on
# instance boot.
 GET  /os-agents
 POST  /os-agents
 PUT  /os-agents/{agent_build_id}
 DELETE  /os-agents/{agent_build_id}
"os_compute_api:os-agents": "rule:admin_api"

This is not 100% deterministic, though, as some services, Nova in particular, enforce policy based on the payload.

For example, these operations can be done by the resource owner:

# Restore a soft deleted server or force delete a server before
# deferred cleanup
 POST  /servers/{server_id}/action (restore)
 POST  /servers/{server_id}/action (forceDelete)
"os_compute_api:os-deferred-delete": "rule:admin_or_owner"

Where as these operations must be done by an admin operator:

# Evacuate a server from a failed host to a new host
 POST  /servers/{server_id}/action (evacuate)
"os_compute_api:os-evacuate": "rule:admin_api"

Both map to the same URL pattern. We tripped over this when working on RBAC in Middleware, and it is going to be an issue with the Whitelist as well.

Looking at the API docs, we can see that difference in the bodies of the operations. The Evacuate call has a body like this:

    "evacuate": {
        "host": "b419863b7d814906a68fb31703c0dbd6",
        "adminPass": "MySecretPass",
        "onSharedStorage": "False"

Whereas the forceDelete call has a body like this:

    "forceDelete": null

From these, it is pretty straight forward to figure out what policy to apply, but as of yet, there is no programmatic way to access that.

It would take a little more scripting to try and identity the set of rules that mean a user should be able to perform those actions with a project scoped token versus the set of APIs that are reserved for cloud operations. However, just looking at the admin_or_owner rule for most is sufficient to indicate that it should be performed using a scoped token. Thus, an end user should be able to determine the set of operations that she can include in a white-list.

by Adam Young at March 16, 2018 05:35 PM

March 12, 2018

OpenStack In Production (CERN)

Hardware burn-in in the CERN datacenter

During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.

CERN hardware procurement follows a formal process compliant with public procurements. Following a market survey to identify potential companies in CERN's member states, a tender specification is sent to the companies asking for offers based on technical requirements.

Server burn in goals

Following the public procurement processes at CERN, large hardware deliveries occur once or twice a year and smaller deliveries multiple times per year. The overall resource management at CERN was covered in a previous blog. Part of the steps before production involves burn in of new servers. The goals are
  • Ensure that the hardware delivered complies with CERN Technical Specifications
  • Find systematic issues with all machines in a delivery such as bad firmware
  • Identify failed components in single machines
  • Provoke early failure in failing components due to high load during stress testing
Depending on the hardware configuration, the burn-in tests take on average around two weeks but do vary significantly (e.g. for systems with large memory amounts, the memory tests alone can take up to two weeks). This has been found to be a reasonable balance between achieving the goals above compared to delaying the production use of the machines with further testing which may not find more errors.

Successful execution of the CERN burn in processes is required in the tender documents prior to completion of the invoicing.


The CERN hardware follows a lifecycle from procurement to retirement as outlined below. The parts marked in red are the ones currently being implemented as part of the CERN Bare Metal deployment.

As part of the evaluation, test systems are requested from the vendor and these are used to validate compliance with the specifications. The results are also retained to ensure that the bulk equipment deliveries correspond to the initial test system configurations and performance.

Preliminary Checks

CERN requires that the Purchase Order ID  and an unique System Serial Number are set in the NVRAM of the Baseboard Management Controller (BMC), in the Field Replaceable Unit (FRU) fields Product Asset Tag (PAT) and Product Serial (PS) respectively:

# ipmitool fru print 0 | tail -2
 Product Serial        : 245410-1
 Product Asset Tag     : CD5792984

The Product Asset Tag is set to the CERN delivery number and the Product Serial is set to the unique serial number for the system unit.

Likewise, certain BIOS fields have to be set correctly such as booting from network before disk to ensure the systems can be easily commissioned.

Once these basic checks have been done, the burn in process can start. A configuration file, containing the burn-in tests to be run, is created according on the information stored in the PAT and PS FRU fields. Based on the content of the configuration file, the enabled tests will automatically start.

Burn in

The burn in process itself is highlighted in red in the workflow above, consisting of the following steps
  • Memory
  • CPU
  • Storage
  • Benchmarking
  • Network


The memtest stress tester is used for validation of the RAM in the system. Details of the tool are available at


Testing the CPU is performed using a set of burn tools, burnK7 or burnP6, and burn MMX. These tools not only test the CPU itself but are also useful to find cooling issues such as broken fans since the power load is significant with the processors running these tests.


Disk burn ins are intended to create the conditions for early drive failure. The bathtub curve aims to cause the early failure drives to fail prior to production.

With this aim, we use the badblocks code to repeatedly read/write the disks. SMART counters are then checked to see if there are significant numbers of relocated bad blocks and the CERN tenders require disk replacement if the error rate is high.

We still use this process although the primary disk storage for the operating system has now changed to SSD. There may be a case for minimising the writing on an SSD to maximise the life cycle of the units.


Many of the CERN hardware procurements are based on price for total compute capacity needed. With the nature of most of the physics processing, the total throughput of the compute farm is more important than the individual processor performance. Thus, it may be that the most total performance can be achieved by choosing processors which are slightly slower but less expensive.

CERN currently measures the CPU performance using a set of benchmarks based on a subset of the SPEC 2006 suite. The subset, called HEPSpec06, is run in parallel on each of the cores in the server to determine the total throughput from the system. Details are available at the HEPiX Benchmarking Working Group web site.

Since the offers include the expected benchmark performance, the results of the benchmarking process are used to validate the technical questionnaire submitted by the vendors. All machines in the same delivery would be expected to produce similar results so variations between different machines in the same batch are investigated.

CPU benchmarking can also be used to find problems where there is significant difference across a batch, such as incorrect BIOS settings on a particular system.

Disk performance is checked using a reference fio access suite. A minimum performance level in I/O is also required in the tender documents.


Networking interfaces are difficult to burn in compared to disks or CPU. To do a reasonable validation,  at lest two machines are needed. With batches of 100s of servers, a simple test against a single end point will produce unpredictable results.

Using a network broadcast, the test finds other machines running the stress test, they pair up and run a number of tests.

  • iperf3 is used for bandwidth, reversed bandwidth, udp and reversed udp
  • iperf for full duplex testing (currently missing from iperf3)
  • ping is used for congestion testing

Looking forward

CERN is currently deploying Ironic into production for bare metal management of machines. Integrating the burn in and retirement stages into the bare metal management states would bring easy visibility of the current state as the deliveries are processed.

The retirement stage is also of interest to ensure that there is no CERN configuration in the servers (such as Ironic BMC credentials or IP addresses).  CERN has often donated retired servers to other high energy physics sites such as SESAME in Jordan and Morocco which requires a full server factory reset before dismounting. This retirement step would be a more extreme cleaning followed by complete removal from the cloud.

Discussing with other scientific laboratories such as SKA through the OpenStack Scientific special interest group has shown interest in extending Ironic to automate the server on-boarding and retirement processes as described in the session at the OpenStack Sydney summit. We'll be following up on these discussions at Vancouver.


  • CERN IT department -
  • CERN Ironic and Rework Contributors 
    • Alexandru Grigore
    • Daniel Abad
    • Mateusz Kowalski


by Tim Bell ( at March 12, 2018 11:29 AM

Lars Kellogg-Stedman

Using Docker macvlan networks

A question that crops up regularly on #docker is "How do I attach a container directly to my local network?" One possible answer to that question is the macvlan network type, which lets you create "clones" of a physical interface on your host and use that to attach containers directly …

by Lars Kellogg-Stedman at March 12, 2018 04:00 AM

March 07, 2018

Andrew Beekhof

A New Fencing Mechanism (TBD)

Protecting Database Centric Applications

In the same way that some application require the ability to persist records to disk, for some applications the loss of access to the database means game over - more so than disconnection from the storage.

Cinder-volume is one such application and as it moves towards an active/active model, it is important that a failure in one peer does not represent a SPoF. In the Cinder architecture, the API server has no way to know if the cinder- volume process is fully functional - so they will still recieve new requests to execute.

A cinder-volume process that has lost access to the storage will naturally be unable to complete requests. Worse though is loosing access to the database, as this will means the result of an action cannot be recorded.

For some operations this is ok, if wasteful, because the operation will fail and be retried. Deletion of something that was already deleted is usually treated as a success and re-attempted operations for creating volume will return a new volume. However performing the same resize operation twice is highly problematic since the recorded old size no longer matches the actual size.

Even the safe operations may never complete because the bad cinder-volume process may end up being asked to perform the cleanup operations from its own failures, which would result in additional failures.

Additionally, despite not being recommended, some Cinder drivers make use of locking. For those drivers it is just as crucial that any locks held by a faulty or hung peer can be recovered within a finite period of time. Hence the need for fencing.

Since power-based fencing is so dependant on node hardware and there is always some kind of storage involved, the idea of leveraging the SBD[1] ( Storage Based Death ) project’s capabilities to do disk based heartbeating and poison-pills is attractive. When combined with a hardware watchdog, it is an extremely reliable way to ensure safe access to shared resources.

However in Cinder’s case, not all vendors can provide raw access to a small block device on the storage. Additionally, it is really access to the database that needs protecting not the storage. So while useful, it is still relatively easy to construct scenarios that would defeat SBD.

A New Type of Death

Where SBD uses storage APIs to protect applications persisting data to disk, we could also have one based on SQL calls that did the same for Cinder-volume and other database centric applications.

I therefor propose TBD - “Table Based Death” (or “To Be Decided” depending on how you’re wired).

Instead of heartbeating to a designated slot on a block device, the slots become rows in a small table in the database that this new daemon would interact with via SQL.

When a peer is connected to the database, a cluster manager like Pacemaker can use a poison pill to fence the peer in the event of a network, node, or resource level failure. Should the peer ever loose quorum or its connection to the database, surviving peers can assume with a degree of confidence that it will self terminate via the watchdog after a known interval.

The desired behaviour can be derived from the following properties:

  1. Quorum is required to write poison pills into a peer’s slot

  2. A peer that finds a poison pill in its slot triggers its watchdog and reboots

  3. A peer that looses connection to the database won’t be able to write status information to its slot which will trigger the watchdog

  4. A peer that looses connection to the database won’t be able to write a poison pill into another peer’s slot

  5. If the underlying database looses too many peers and reverts to read-only, we won’t be able to write to our slot which triggers the watchdog

  6. When a peer that looses connection to its peers, the survivors would maintain quorum(1) and write a poison pill to the lost node (1) ensuring the peer will terminate due to scenario (2) or (3)

If N seconds is the worst case time a peer would need to either notice a poison pill, or disconnection from the database, and trigger the watchdog. Then we can arrange for services to be recovered after some multiple of N has elasped in the same way that Pacemaker does for SBD.

While TBD would be a valuable addition to a traditional cluster architecture, it is also concievable that it could be useful in a stand-alone configuration. Consideration should therefor be given during the design phase as to how best consume membership, quorum, and fencing requests from multiple sources - not just a particular application or cluster manager.


Just as in the SBD architecture, we need TBD to be configured to use the same persistent store (database) as is being consumed by the applications it is protecting. This is crucial as it means the same criteria that enables the application to function, also results in the node self-terminating if it cannot be satisfied.

However for security reasons, the table would ideally live in a different namespace and with different access permissions.

It is also important to note that significant design challenges would need to be faced in order to protect applications managed by the same cluster that was providing the highly available database being consumed by TBD. Consideration would particularly need to be given to the behaviour of TBD and the applications it was protecting during shudown and cold-start scenarios. Care would need to be taken in order to avoid unnecessary self-fencing operations and that failure responses are not impacted by correctly handling these scenarios.


[1] SBD lives under the ClusterLabs banner but can operate without a traditional corosync/pacemaker stack.

by Andrew Beekhof ( at March 07, 2018 02:11 AM

March 06, 2018

Adam Young

Generating a Callgraph for Keystone

Once I know a starting point for a call, I want to track the other functions that it calls. pycallgraph will generate an image that shows me that.

All this is done inside the virtual env set up by tox at keystone/.tox/py35

I need a stub of a script file in order to run it. I’ll put this in tmp:

from keystone.identity import controllers
from keystone.server import wsgi
from keystone.common import request

def main():

    d = dict()
    r  = request.Request(d)

    c = controllers.UserV3()

if __name__ == '__main__':

To install pycallgraph:

pip install pycallgraph

And to run it:

 pycallgraph  --max-depth 6  graphviz /tmp/ 

It errors out do to auth issues (it is actually rtunning the code, so don’t do this on a production server)

Here is what it generated.

Click to enlarge. Not great, but it is a start.

by Adam Young at March 06, 2018 09:53 PM

Inspecting Keystone Routes

What Policy is enforced when you call a Keystone API? Right now, there is no definitive way to say. However, with some programmatic help, we might be able to figure it out from the source code. Lets start by getting a complete list of the Keystone routes.

In the WSGI framework that Keystone uses, a Route is the object that used to match the URL. For example, when I try to look at the user with UserId abcd1234, I submit a GET request to the URL https://hostname:port/v3/users/abcd1234. The route path is the pattern /users/{user_id}. The WSGI framework handles the parts of the URL prior to that, and eventually needs to pull out a Python function to execute for the route. Here is how we can generate a list of the route paths in Keystone

from keystone.server import wsgi
app = wsgi.initialize_admin_application()
composing = app['/v3'].application.application.application.application.application.application._app.application.application.application.application
for route in composing._router.mapper.matchlist:

I’ll put the output at the end of this post.

That long chain of .application properties is due to the way that the pipeline is built using the paste file. In keystone/etc/keystone-paste.ini we see:

# The last item in this pipeline must be service_v3 or an equivalent
# application. It cannot be a filter.
pipeline = healthcheck cors sizelimit http_proxy_to_wsgi osprofiler url_normalize request_id build_auth_context token_auth json_body ec2_extension_v3 s3_extension service_v3

Each of those pipeline elements are python classes specified earlier in the file, that honor the middleware contract. Most of these can be traced back to the keystone.common.wsgi.Middleware base class, which implements this as __call__ method.

    def __call__(self, request):
        response = self.process_request(request)
        if response:
            return response
        response = request.get_response(self.application)
        return self.process_response(request, response)

The odd middleware out is AuthContextMiddleware which extends from from keystonemiddleware.auth_token.BaseAuthProtocol. See if you can spot the difference:

    def __call__(self, req):
        """Handle incoming request."""
        response = self.process_request(req)
        if response:c
            return response
        response = req.get_response(self._app)
        return self.process_response(response

Yep: self._app.

Here is the output from the above code, executed in the python interpreter. This does not have the Verbs in it yet, but a little more poking should show where they are stored:

>>> for route in composing._router.mapper.matchlist:
...     print(route.routepath)

by Adam Young at March 06, 2018 05:17 PM

March 04, 2018

Rich Bowen


I’m heading home from SnowpenStack and it was quite a ride. As Theirry said in our interview at the end of Friday (coming soon to a YouTube channel near you), rather than spoiling things, the freak storm and subsequent closure of the event venue served to create a shared experience and camaraderie that made it even better.

In the end I believe I got 29 interviews, and I’ll hopefully be supplementing this with a dozen online interviews in the coming weeks.

If you missed your interview, or weren’t at the PTG, please contact me and we’ll set something up. And I’ll be in touch with all of the PTLs who were not already represented in one of my interviews.

A huge thank you to everyone that made time to do an interview, and to Erin and Kendall for making everything onsite go so smoothly.

by rbowen at March 04, 2018 07:52 AM

March 02, 2018

OpenStack In Production (CERN)

Expiry of VMs in the CERN cloud

The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.

Personal projects have a number of differences compares to other projects in the cloud
  • No non-standard flavors
  • No additional quota can be requested
  • Should not be used for production services
  • VMs are deleted automatically when the person stops being a CERN user
With the number of cloud users increasing to over 3,000, there is a corresponding growth in the number of cores used by personal projects, growing by 1,200 cores in the past year. For cases like training users, there is often the case that the VMs are created and the user then does not remember to delete the resources so they consume cores which could be used for compute capacity to analyse the data from the LHC.

One possible approach would be to reduce the quota further. However, tests such as setting up a Kubernetes cluster with OpenStack Magnum often need several VMs to perform the different roles so this would limit the usefulness of personal projects. The usage of the full quota is also rare.

VM Expiration

Based on a previous service which offered resources on demand (called CVI based on Microsoft SCVMM), the approach was taken to expire personal virtual machines.
  • Users can create virtual machines up to the limit of their quota
  • Personal VMs are marked with an expiry date
  • Prior to their expiry, the user is sent several mails to inform them their VM will expire soon and how to extend it if it is still useful.
  • On expiry, the virtual machine is locked and shutdown. This helps to catch cases where people have forgotten to prolong their VMs.
  • One week later, the virtual machine is deleted, freeing up the resources.


We use Mistral to automate several OpenStack tasks in the cloud (such as regular snapshots and project creation/deletion). This has the benefit of a clean audit log to show what steps worked/failed along with clear input/output states supporting retries and an authenticated cloud cron for scheduling.

Our OpenStack projects have some properties set when they are created. This is used to indicate additional information like the accounting codes to be charged for the usage. There are properties for indicating if the type of project such as personal and if the expiration workflow should apply. Mistral YAQL code can then select resources where expiration applies.

task(retrieve_all_projects) => $.id, name => $.name, enabled => $.enabled, type => $.get('type','none'),expire => $.get('expire','off'))).where($.type='personal').where($.enabled).where($.expire='on')

The expire_at parameter is stored as a VM property. This makes it visible for automation such as CLIs through the openstack client show server CLI.

There are several parts to the process
  • A cron trigger'd workflow which
    • Machines in error state or currently building are ignored
    • A newly created machine which does not have an expiry date set has the expiration date set according to the grace period
    • Sees if any machines are entering close to their expiry time and sends a mail to the owner
    • Checks for invalid settings of the expire_at property (such as people setting it a long way in the future or deleting the property) and restores a reasonable value if this is detected
    • If a machine has reached it's expiry date, it's locked and shutdown
    • If a machine has past it's date by the grace period, it's deleted
    • A workflow, launched by Horizon or from the CLI
      • Retrieves the expire_at value and extends it by the prolongation period
    The user notification is done using a set of mail templates and a dedicated workflow ( This allows templates such as instance reminders to have details about the resources included, such as the example from the mail template.

    The Virtual Machine {instance} from the project {project_name} in the Cloud Infrastructure Service will expire on {expire_date}.

    A couple of changes to Mistral will be submitted upstream
    • Support for HTML mail bodies which allows us to have a nicer looking e-mail for notification with links included
    • Support for BCC/CC on the mail so that the OpenStack cloud administrator e-mail can also be kept on copy when there are notifications
    A few minor changes to Horizon were also done (currently local patches)
    • Display expire_at value on the instance details page
    • Add a 'prolong' action so that instances can be prolonged via the web by using the properties editor to set the date of the expiry (defaulting to the current date with the expiry time). This launches the workflow for prolonging the instance.


    Jose Castro Leon from the CERN cloud team


    by Jose Castro Leon ( at March 02, 2018 11:46 AM

    March 01, 2018

    Carlos Camacho

    My 2nd birthday as a Red Hatter

    This post will be about to speak about my experience working in TripleO as a Red Hatter for the last 2 years. In my 2nd birthday as a Red Hatter, I have learned about many technologies, really a lot… But the most intriguing thing is that here you never stop learning. Not just because you just don’t want to learn new things, instead, is because of the project’s nature, this project… TripleO…

    TripleO (Openstack On Openstack) is a software aimed to deploy OpenStack services using the same OpenStack ecosystem, this means that we will deploy a minimal OpenStack instance (Undercloud) and from there, deploy our production environment (Overcloud)… Yikes! What a mouthful, huh? Put simply, TripleO is an installer which should make integrators/operators/developers lives easier, but the reality sometimes is far away from the expectation.

    TripleO is capable of doing wonderful things, with a little of patience, love, and dedication, your hands can be the right hands to deploy complex environments at easy.

    One of the cool things being one of the programmers who write TripleO, from now on TripleOers, is that many of us also use the software regularly. We are writing code not just because we are told to do it, but because we want to improve it for our own purposes.

    Part of the programmers’ motivation momentum have to do with TripleO’s open‐source nature, so if you code in TripleO you are part of a community.

    Congratulations! As a TripleO user or a TripleOer, you are a part of our community and it means that you’re joining a diverse group that spans all age ranges, ethnicities, professional backgrounds, and parts of the globe. We are a passionate bunch of crazy people, proud of this “little” monster and more than willing to help others enjoy using it as much as we do.

    Getting to know the interface (the templates, Mistral, Heat, Ansible, Docker, Puppet, Jinja, …) and how all components are tight together, probably is one of the most daunting aspects of TripleO for newcomers (and not newcomers). This for sure will raise the blood pressure of some of you who tried using TripleO in the past, but failed miserably and gave up in frustration when it did not behave as expected. Yeah.. sometimes that “$h1t” happens…

    Although learning TripleO isn’t that easy, the architecture updates, the decoupling of the role services “compostable roles”, the backup and restore strategies, the integration of Ansible among many others have made great strides toward alleviating that frustration, and the improvements continue through to today.

    So this is the question…

    Is TripleO meant to be “fast to use” or “fast to learn”?

    There is a significant way of describing software products, but we need to know what our software will be used for… TripleO is designed to work at scale, it might be easier to deploy manually a few controllers and computes, but what about deploying 100 computes, 3 controllers and 50 cinder nodes, all of them configured to be integrated and work as one single “cloud”? Buum!. So there we find the TripleO benefits if we want to make it scale we need to make it fast to use…

    This means that we will find several customizations, hacks, workarounds, to make it work as we need it.

    The upside to this approach is that TripleO evolved to be super-ultra-giga customizable so operators are enabled to produce great environments blazingly fast..

    The downside, Jaja, yes.. there is a downside “or several”. As with most things that are customized, TripleO became somewhat difficult for new people to understand. Also, it’s incredibly hard to test all the possible deployments, and when a user does non-standard or not supported customizations, the upgrades are not as intuitive as they need…

    This trade‐off is what I mean when I say “fast to use versus fast to learn.” You can be extremely productive with TripleO after you understand how it thinks “yes, it thinks”.

    However, your first few deployments and patches might be arduous. Of course, alleviating that potential pain is what our work is about. IMHO the pros are more than the cons and once you find a niche to improve it will be a really nice experience.

    Also, we have the TripleO YouTube channel a place to push video tutorials and deep dive sessions driven by the community for the community.

    For the Spanish community we have a 100% translated TripleO UI, go to and help us to reach as many languages as possible!!! was born on July 5th of 2016 (first GitHub commit), yeah is my way of expressing my gratitude to the community doing some CtrlC + CtrlV recipes to avoid the frustration of working with TripleO and not having something deployed and easy to be used ASAP.

    Anstack does not have much traffic but it reached, the TripleO cheatsheets were on and FOSDEM, so in general, is really nice. When people reference your writings anywhere. Maybe in the future can evolve to be more related to ANsible and openSTACK ;) as TripleO is adding more and more support for Ansible.

    What about Red Hat? Yeahp, I have a long time speaking about the project but haven’t spoken about the company making it all real. Red Hat is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies.

    There is a strong feeling of belonging in Red Hat, you are part of a team, a culture and you are able to find a perfect balance between your work and life. Also, having all people from all over the globe makes a perfect place for sharing ideas and collaborate. Not all of it is good, i.e. Working mostly remotely in upstream communities can be really hard to manage if you are not 100% sure about the tasks that need to be done.

    Keep rocking and become part of the TripleO community!

    by Carlos Camacho at March 01, 2018 12:00 AM

    February 28, 2018

    Adam Young

    OpenStack Role Assignment Inheritance for CloudForms

    Operators expect to use CloudForms to perform administrative tasks. For this reason, the documentation for OpenStack states that the Keystone user must have an ‘admin’ role. We found at least one case, however, where this was not sufficient. Fortunately, we have a better approach, and one that can lead to success in a wider array of deployments.


    CloudForms uses the role assignments for the give user account to enumerate the set of projects. Internally it creates a representation of these projects to be used to track resources. However, The way that ‘admin’ is defined on OpenStack is tied to a single project. This means that CloudForms really has no way to ask “what projects can this user manage?” Now, while admin anywhere is admin everywhere so you would not think that you need to enumeration projects, but it turns out that some of the more complex operations, such as mounting a volume, has to cross service boundaries, and need the project abstraction to link the sets of operations. CloudForms design did not see this disconnect, and so some of those operations fail.

    Lets assume, for the moment, that a user had to have a role on project in order to perform operations on that project. The current admin-everywhere approach would break. What CloudForms would require is an automated way to give a user a role on a project as soon as that project was created. It turns out that CloudForms is not the only thing that has this requirement.

    Role Assignment Inheritance

    Keystone projects do not have to be organized as a flat collection. They can be nested into a tree form. This is called “Hierarchical Multitenancy.” Added to that, a role can be assigned to a user or group on parent project and that role assignment is inherited down the tree. This is called “Role Assignment Inheritance.”

    This presentation, while old, does a great job of putting the details together.

    You don’t need to do anything different in your project setup to take advantage of this mechanism. Here’s something that is subtle: a Domain IS A project. Every project is already in a domain, and thus has a parent project. Thus, you can assign a user a role on the domain-as-a-project, and they will have that role on every project inside that domain.

    Sample Code

    Here is in command line form.

    openstack role add --user CloudAdmin --user-domain Default --project Default --project-domain Default --inherited admin

    Lets take those arguments step by step.

    --user CloudAdmin  --user-domain Default

    This is the user that CloudForms is using to connect to Keystone and OpenStack. Every user is owned by a domain, and this user is owned by the Default” domain.

    --project Default --project-domain Default

    This is blackest of magic. The Default domain IS-A project. So it owns itself.


    A role assignment is either on a project OR on all its subprojects. So, the user does not actually have a role that is usable against the Default DOMAIN-AS-A-PROJECT, but only on all odf the subordinate projects. This might seem strange, but it was built this way for exactly this reason: being able to distinguish between levels of a hierarchy.


    This is the role name.


    With this role assignment, the CloudForms Management Engine instance can perform all operations on all projects within the default domain. If you add another domain to manage a separate set of projects, you would need to perform this same role assignment on the new domain as well.

    I assume this is going to leave people with a lot of questions. Please leave comments, and I will try to update this with any major concepts that people want made lucid.

    by Adam Young at February 28, 2018 07:04 PM

    February 27, 2018

    Lars Kellogg-Stedman

    Listening for connections on all ports/any port

    On IRC -- and other online communities -- it is common to use a "pastebin" service to share snippets of code, logs, and other material, rather than pasting them directly into a conversation. These services will typically return a URL that you can share with others so that they can see the …

    by Lars Kellogg-Stedman at February 27, 2018 05:00 AM

    February 26, 2018

    Lars Kellogg-Stedman

    Grouping aggregation queries in Gnocchi 4.0.x

    In this article, we're going to ask Gnocchi (the OpenStack telemetry storage service) how much memory was used, on average, over the course of each day by each project in an OpenStack environment.


    I'm working with an OpenStack "Pike" deployment, which means I have Gnocchi 4.0.x. More …

    by Lars Kellogg-Stedman at February 26, 2018 05:00 AM

    February 23, 2018

    Carlos Camacho

    TripleO deep dive session #12 (config-download)

    This is the 12th release of the TripleO “Deep Dive” sessions

    Thanks to James Slagle for this new session, in which he will describe and speak about a feature called config-download.

    In this session we will have an update for the TripleO ansible integration called config-download. It’s about aplying all the software configuration with Ansible instead of doing it with the Heat agents.

    So please, check the full session content on the TripleO YouTube channel.

    Please check the sessions index to have access to all available content.

    by Carlos Camacho at February 23, 2018 12:00 AM

    February 21, 2018

    OpenStack In Production (CERN)

    Maximizing resource utilization with Preemptible Instances


    The CERN cloud consists of around 8,500 hypervisors providing over 36,000
    virtual machines. These provide the compute resources for both the laboratory's
    physics program but also for the organisation's administrative operations such
    as paying bills and reserving rooms at the hostel.

    The resources themselves are generally ordered once to twice a year with servers being kept for around 5 years. Within the CERN budget, the resource planning teams looks at:
    • The resources required to run the computing services requirements for the CERN laboratory. These are projected using capacity planning trend data and upcoming projects such as video conferencing.
    With the installation and commissioning of thousands of servers concurrently
    (along with their associated decommissioning 5 years later), there are scenarios
    to exploit underutilised servers. Programs such as LHC@Home are used but we have also been interested to expand the cloud to provide virtual machine instances which can be rapidly terminated in the event of
    • Resources being required for IT services as they scale out for events such as a large scale web cast on a popular topic or to provision instances for a new version of an application.
    • Partially full hypervisors where the last remaining cores are not being requested (the Tetris problem).
    • Compute servers at the end of their lifetime which are used to the full before being removed from the computer centre to make room for new deliveries which are more efficient and in warranty.
    The characteristics of this workload is that it should be possible to stop an
    instance within a short time (a few minutes) compared to a traditional physics job.

    Resource Management In Openstack

    Operators use project quotas for ensuring the fair sharing of their infrastructure. The problem with this, is that quotas pose as hard limits.This
    leads to actually dedicating resources for workloads even if they are not used
    all the time or to situations where resources are not available even though
    there is quota still to use.

    At the same time, the demand for cloud resources is increasing rapidly. Since
    there is no cloud with infinite capabilities, operators need a way to optimize
    the resource utilization before proceeding to the expansion of their infrastructure.

    Resources in idle state can occur, showing lower cloud utilization than the full
    potential of the acquired equipment while the users’ requirements are growing.

    The concept of Preemptible Instances can be the solution to this problem. These
    type of servers can be spawned on top of the project's quota, making use of the
    underutilised  capabilities. When the resources are requested by tasks with
    higher priority (such as approved quota), the preemptible instances are
    terminated to make space for the new VM.

    Preemptible Instances with Openstack

    Supporting preemptible instances, would mirror the AWS Spot Market and the
    Google Preemptible Instances. There are multiple things to be addressed here as
    part of an implementation with OpenStack, but the most important can be reduced to these:
    1. Tagging Servers as Preemptible
    In order to be able to distinguish between preemptible and non-preemptible
    servers, there is the need to tag the instances at creation time. This property
    should be immutable for the lifetime of the servers.
    1. Who gets to use preemptible instances
    There is also the need to limit which user/project is allowed to use preemptible
    instances. An operator should be able to choose which users are allowed to spawn this type of VMs.
    1. Selecting servers to be terminated
    Considering that the preemptible instances can be scattered across the different cells/availability zones/aggregates, there has to be “someone” able to find the existing instances, decide the way to free up the requested resources according to the operator’s needs and, finally, terminate the appropriate VMs.
    1. Quota on top of project’s quota
    In order to avoid possible misuse, there could to be a way to control the amount of preemptible resources that each user/project can use. This means that apart from the quota for the standard resource classes, there could be a way to enforce quotas on the preemptible resources too.

    OPIE : IFCA and Indigo Dataclouds

    In 2014, there were the first investigations into approaches by Alvaro Lopez
    from IFCA (
    As part of the EU Indigo Datacloud project, this led to the development of the
    OpenStack Pre-Emptible Instances package (
    This was written up in a paper to Journal of Physics: Conference Series
    ( and
    presented at the OpenStack summit (

    Prototype Reaper Service

    At the OpenStack Forum during a recent OpenStack summit, a detailed discussion took place on how spot instances could be implemented without significant changes to Nova. The ideas were then followed up with the OpenStack Scientific Special Interest Group.

    Trying to address the different aspects of the problem, we are currently
    prototyping a “Reaper” service. This service acts as an orchestrator for
    preemptible instances. It’s sole purpose is to decide the way to free up the
    preemptible resources when they are requested for another task.

    The reason for implementing this prototype, is mainly to help us identify
    possible changes that are needed in Nova codebase to support Preemptible

    More on this WIP can be found here: 


    The concept of Preemptible Instances gives operators the ability to provide a
    more "elastic" capacity. At the same time, it enables the handling of increased
    demand for resources, with the same infrastructure, by maximizing the cloud

    This type of servers is perfect for tasks/apps that can be terminated at any
    time, enabling the users to take advantage of extra cpu power on demand without the fixed limits that quotas enforce.

    Finally, here in CERN, there is an ongoing effort to provide a prototype
    orchestrator for Preemptible Servers with Openstack, in order to pinpoint the
    changes needed in Nova to support this feature optimally. This could also be
    available in future for other OpenStack clouds in use by CERN such as the
    T-Systems Open Telekom Cloud through the Helix Nebula Open Science Cloud


    • Theodoros Tsioutsias (CERN openlab fellow working on Huawei collaboration)
    • Spyridon Trigazis (CERN)
    • Belmiro Moreira (CERN)


    by Theodoros Tsioutsias ( at February 21, 2018 12:37 PM

    February 16, 2018

    Matthias Runge

    Testing TripleO on own OpenStack deployment

    For some use cases, it's quite useful to test TripleO deployments on a OpenStack powered cloud, rather than using a baremetal system. The following article will show you, how to do it:

    We're going to use tripleo-quickstart . This also assumes, you have downloaded your OpenStack handy and stored …

    by mrunge at February 16, 2018 08:30 AM

    Andrew Beekhof

    A New Thing

    I made a new thing.

    If you’re interested in Kubernetes and/or managing replicated applications, such as Galera, then you might also be interested in an operator that allows this class of applications to be managed natively by Kubernetes.

    There is plenty to read on why the operator exists, how replication is managed and the steps to install it if you’re interested in trying it out.

    There is also a screencast that demonstrates the major concepts:


    Feedback welcome.

    by Andrew Beekhof ( at February 16, 2018 03:32 AM

    February 15, 2018

    Andrew Beekhof

    Two Nodes - The Devil is in the Details

    tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but while it’s possible to construct good ones, most will have subtle failure modes

    The first step towards creating any HA system is to look for and try to eliminate single points of failure, often abbreviated as SPoF.

    It is impossible to eliminate all risk of downtime and especially when one considers the additional complexity that comes with introducing additional redunancy, concentrating on single (rather than chains of related and therefor decreasingly probable) points of failure is widely accepted as a suitable compromise.

    The natural starting point then is to have more than one node. However before the system can move services to the surviving node after a failure, in general, it needs to be sure that they are not still active elsewhere.

    So not only are we looking for SPoFs, but we are also looking to balance risks and consequences and the calculus will be different for every deployment [1]

    There is no downside if a failure causes both members of a two node cluster to serve up the same static website. However its a very different story if it results in both sides independently managing a shared job queue or providing uncoordinated write access to a replicated database or shared filesystem.

    So in order to prevent a single node failure from corrupting your data or blocking recovery, we rely on something called fencing.


    At it s heart, fencing turns a question Can our peer cause data corruption? into an answer no by isolating it both from incoming requests and persistent storage. The most common approach to fencing is to power off failed nodes.

    There are two categories of fencing which I will call direct and indirect but could equally be called active and passive. Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect relies on the failed node to somehow recognise it is in an unhealthy state (or is at least preventing remaining members from recovering) and signal a hardware watchdog to panic the machine.

    Quorum helps in both these scenarios.

    Direct Fencing

    In the case of direct fencing, we can use it to prevent fencing races when the network fails. By including the concept of quorum, there is enough information in the system (even without connectivity to their peers) for nodes to automatically know whether they should initiate fencing and/or recovery.

    Without quorum, both sides of a network split will rightly assume the other is dead and rush to fence the other. In the worst case, both sides succeed leaving the entire cluster offline. The next worse is a death match , a never ending cycle of nodes coming up, not seeing their peers, rebooting them and initiating recovery only to be rebooted when their peer goes through the same logic.

    The problem with fencing is that the most commonly used devices become inaccessible due to the same failure events we want to use them to recover from. Most IPMI and iLO cards both loose power with the hosts they control and by default use the same network that is causing the peers to believe the others are offline.

    Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the point hardware is being purchased.

    Indirect Fencing

    Quorum is also crucial for driving indirect fencing and, when done right, can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.

    In such a setup, the watchdog’s timer is reset every N seconds unless quorum is lost. If the timer (usually some multiple of N) expires, then the machine performs an ungraceful power off (not shutdown).

    This is very effective but without quorum to drive it, there is insufficient information from within the cluster to determine the difference between a network outage and the failure of your peer. The reason this matters is that without a way to differentiate between the two cases, you are forced to choose a single behaviour mode for both.

    The problem with choosing a single response is that there is no course of action that both maximises availability and prevents corruption.

    • If you choose to assume the peer is alive but it actually failed, then the cluster has unnecessarily stopped services.

    • If you choose to assume the peer is dead but it was just a network outage, then the best case scenario is that you have signed up for some manual reconciliation of the resulting datasets.

    No matter what heuristics you use, it is trivial to construct a single failure that either leaves both sides running or where the cluster unnecessarily shuts down the surviving peer(s). Taking quorum away really does deprive the cluster of one of the most powerful tools in its arsenal.

    Given no other alternative, the best approach is normally to sacrificing availability. Making corrupted data highly available does no-one any good and manually reconciling diverant datasets is no fun either.


    Quorum sounds great right?

    The only drawback is that in order to have it in a cluster with N members, you need to be able to see N/2 + 1 of your peers. Which is impossible in a two node cluster after one node has failed.

    Which finally brings us to the fundamental issue with two-nodes:

    quorum does not make sense in two node clusters, and

    without it there is no way to reliably determine a course of action that both maximises availability and prevents corruption

    Even in a system of two nodes connected by a crossover cable, there is no way to conclusively differentiate between a network outage and a failure of the other node. Unplugging one end (who’s likelihood is surely proportional to the distance between the nodes) would be enough to invalidate any assumption that link health equals peer node health.

    Making Two Nodes Work

    Sometimes the client can’t or wont make the additional purchase of a third node and we need to look for alternatives.

    Option 1 - Add a Backup Fencing Method

    A node’s iLO or IPMI device represents a SPoF because, by definition, if it fails the survivors cannot use it to put the node into a safe state. In a cluster of 3 nodes or more, we can mitigate this a quorum calculation and a hardware watchdog (an indirect fencing mechanism as previously discussed). In a two node case we must instead use network power switches (aka. power distribution units or PDUs).

    After a failure, the survivor first attempts to contact the primary (the built-in iLO or IPMI) fencing device. If that succeeds, recovery proceeds as normal. Only if the iLO/IPMI device fails is the PDU invoked and assuming it succeeds, recovery can again continue.

    Be sure to place the PDU on a different network to the cluster traffic, otherwise a single network failure will prevent access to both fencing devices and block service recovery.

    You might be wondering at this point… doesn’t the PDU represent a single point of failure? To which the answer is “definitely“.

    If that risk concerns you, and you would not be alone, connect both peers to two PDUs and tell your cluster software to use both when powering peers on and off. Now the cluster remains active if one PDU dies, and would require a second fencing failure of either the other PDU or an IPMI device in order to block recovery.

    Option 2 - Add an Arbitrator

    In some scenarios, although a backup fencing method would be technically possible, it is politically challenging. Many companies like to have a degree of separation between the admin and application folks, and security conscious network admins are not always enthusiastic about handing over the usernames and passwords to the PDUs.

    In this case, the recommended alternative is to create a neutral third-party that can supplement the quorum calculation.

    In the event of a failure, a node needs to be able to see ether its peer or the arbitrator in order to recover services. The arbitrator also includes to act as a tie-breaker if both nodes can see the arbitrator but not each other.

    This option needs to be paired with an indirect fencing method, such as a watchdog that is configured to panic the machine if it looses connection to its peer and the arbitrator. In this way, the survivor is able to assume with reasonable confidence that its peer will be in a safe state after the watchdog expiry interval.

    The practical difference between an arbitrator and a third node is that the arbitrator has a much lower footprint and can act as a tie-breaker for more than one cluster.

    Option 3 - More Human Than Human

    The final approach is for survivors to continue hosting whatever services they were already running, but not start any new ones until either the problem resolves itself (network heals, node reboots) or a human takes on the responsibility of manually confirming that the other side is dead.

    Bonus Option

    Did I already mention you could add a third node? We test those a lot :-)

    Two Racks

    For the sake of argument, lets imagine I’ve convinced you the reader on the merits of a third node, we must now consider the physical arrangement of the nodes. If they are placed in (and obtain power from), the same rack, that too represents a SPoF and one that cannot be resolved by adding a second rack.

    If this is surprising, consider what happens when the rack with two nodes fails and how the surviving node would differentiate between this case and a network failure.

    The short answer is that it can’t and we’re back to having all the problems of the two-node case. Either the survivor:

    • ignores quorum and incorrectly tries to initiate recovery during network outages (whether fencing is able to complete is a different story and depends on whether PDU is involved and if they share power with any of the racks), or

    • respects quorum and unnecessarily shuts itself down when its peer fails

    Either way, two racks is no better than one and the nodes must either be given independant supplies of power or be distributed accross three (or more depending on how many nodes you have) racks.

    Two Datacenters

    By this point the more risk averse readers might be thinking about disaster recovery. What happens when an asteroid hits the one datacenter with our three nodes distributed across three different racks? Obviously Bad Things(tm) but depending on your needs, adding a second datacenter might not be enough.

    Done properly, a second datacenter gives you a (reasonably) up-to-date and consistent copy of your services and their data. However just like the two- node and two-rack scenarios, there is not enough information in the system to both maximise availability and prevent corruption (or diverging datasets). Even with three nodes (or racks), distributing them across only two datacenters leaves the system unable to reliably make the correct decision in the (now far more likely) event that the two sides cannot communicate.

    Which is not to say that a two datacenters solution is never appropriate. It is not uncommon for companies to want a human in the loop before taking the extraordinary step of failing over to a backup datacenter. Just be aware that if you want automated failure, you’re either going to need a third datacenter in order for quorum to make sense (either directly or via an arbitrator) or find a way to reliably power fence an entire datacenter.


    [1] Not everyone needs redundant power companies with independent transmission lines. Although the paranoia paid off for at least one customer when their monitoring detected a failing transformer. The customer was on the phone trying to warn the power company when it finally blew.

    by Andrew Beekhof ( at February 15, 2018 11:52 PM

    February 10, 2018

    Adam Young

    Deleting an image on RDO

    So I uploaded a qcow image…but did it wrong. It was tagged as raw instead of qcow, and now I want it gone. Only problem….it is stuck.

    $ openstack image delete rhel-server-7.4-update-4-x86_64
    Failed to delete image with name or ID 'rhel-server-7.4-update-4-x86_64': 409 Conflict
    Image 2e77971e-7746-4992-8e1e-7ce1be8528f8 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.

    But….I deleted all of the instances connected to it! Come On!

    Answer is easy once the code-rage wears off…

    When I created a server based on this image, it created a new volume. That volume is locking the image into place.

    $ openstack volume list
    | ID                                   | Name | Status    | Size | Attached to                      |
    | 97a15e9c-2744-4f31-95f3-a13603e49b6d |      | error     |    1 |                                  |
    | c9337612-8317-425f-b313-f8ba9336f1cc |      | available |    1 |                                  |
    | 9560a18f-bfeb-4964-9785-6e76fa720892 |      | in-use    |    9 | Attached to showoff on /dev/vda  |
    | 0188edd7-7e91-4a80-a764-50d47bba9978 |      | in-use    |    9 | Attached to test1 on /dev/vda    |

    See that error? I think its that one. I can’t confirm now, as I also deleted the available one, as I didn’t need it, either.

    $ openstack volume delete 97a15e9c-2744-4f31-95f3-a13603e49b6d
    $ openstack volume delete c9337612-8317-425f-b313-f8ba9336f1cc
    $ openstack image delete rhel-server-7.4-update-4-x86_64

    And that last command succeeded.

    $ openstack image show  rhel-server-7.4-update-4-x86_64
    Could not find resource rhel-server-7.4-update-4-x86_64

    by Adam Young at February 10, 2018 12:06 AM

    February 09, 2018

    Adam Young

    Keystonerc for RDO cloud

    If you are using RDO Cloud and want to do command line Ops, here is the outline of a keystone.rc file you can use to get started.

    unset $( set | awk '{FS="="} /^OS_/ {print $1}' )
    export OS_AUTH_URL=
    export OS_USERNAME={username}
    export OS_PASSWORD={password}
    export OS_USER_DOMAIN_NAME=Default
    export OS_PROJECT_DOMAIN_NAME=Default
    export OS_PROJECT_NAME={projectname}

    You might have been given a different AUTH URL to use. The important parts are appending the /v3/ and explicitly setting the OS_IDENTITY_API_VERSION=3. Setting both is overkill, but you can never have too much over kill.

    Once you have this set, source it, and you can run:

    $ openstack image list
    | ID                                   | Name                                      | Status |
    | af47a290-3af3-4e46-bb56-4f250a3c20a4 | CentOS-6-x86_64-GenericCloud-1706         | active |
    | b5446129-8c75-4ce7-84a3-83756e5f1236 | CentOS-7-x86_64-GenericCloud-1701         | active |
    | 8f41e8ce-cacc-4354-a481-9b9dba4f6de7 | CentOS-7-x86_64-GenericCloud-1703         | active |
    | 42a43956-a445-47e5-89d0-593b9c7b07d0 | CentOS-7-x86_64-GenericCloud-1706         | active |
    | ffff3320-1bf8-4a9a-a26d-5abd639a6e33 | CentOS-7-x86_64-GenericCloud-1708         | active |
    | 28b76dd3-4017-4b46-8dc9-98ef1cb4034f | CentOS-7-x86_64-GenericCloud-1801-01      | active |
    | 2e596086-38c9-41d1-b1bd-bcf6c3ddbdef | CentOS-Atomic-Host-7.1706-GenericCloud    | active |
    | 1dfd12d7-6f3a-46a6-ac69-03cf870cd7be | CentOS-Atomic-Host-7.1708-GenericCloud    | active |
    | 31e9cf36-ba64-4b27-b5fc-941a94703767 | CentOS-Atomic-Host-7.1801-02-GenericCloud | active |
    | c59224e2-c5df-4a86-b7b6-49556d8c7f5c | bmc-base                                  | active |
    | 5dede8d3-a723-4744-97df-0e6ca93f5460 | ipxe-boot                                 | active |

    by Adam Young at February 09, 2018 10:17 PM

    Steve Hardy

    Debugging TripleO revisited - Heat, Ansible & Puppet

    Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail.

    In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments.  I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.

    In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.

    We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet.  In a future post I'll also cover the basics of debugging containerized deployments.

    The TripleO deploy workflow, overview

    A typical TripleO deployment consists of several discrete phases, which are run in order:

    Provisioning of the nodes

    1. A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
    2. Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
    3. Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
    4. Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
    5. The nodes are provisioned by Ironic

    This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).

    Host preparation

    The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):

    1. The node networking is configured, via the os-net-config tool
    2. We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
    3. We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)

    Service deployment, step-by-step configuration

    The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:

    1. We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
    2. We run "" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
    3. We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
    4. We run again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.

    Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).

    Below is a diagram which illustrates this step-by-step deployment workflow:
    TripleO Service configuration workflow

    The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.


    Debugging first steps - what failed?

    Heat Stack create failed.

    Ok something failed during your TripleO deployment, it happens to all of us sometimes!  The next step is to understand the root-cause.

    My starting point after this is always to run:

    openstack stack failures list --long <stackname>

    (undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud
    resource_type: OS::Heat::StructuredDeployment
    physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106
    status: CREATE_FAILED
    status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
    deploy_stdout: |

    PLAY [localhost] ***************************************************************


    TASK [Run puppet host configuration for step 1] ********************************
    ok: [localhost]

    TASK [debug] *******************************************************************
    fatal: [localhost]: FAILED! => {
    "changed": false,
    "failed_when_result": true,
    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
    "Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8",
    "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"
    to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry

    PLAY RECAP *********************************************************************
    localhost : ok=18 changed=12 unreachable=0 failed=1

    We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource

    • The error was on one of the Controllers (ControllerDeployment)
    • The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
    • The failure was during the first step (Step1.0)
    Then we see more clues in the deploy_stdout, ansible failed running the task which runs puppet on the host, it looks like a problem with the puppet code.

    With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:

    (undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id
    (undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89
    | 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0 | ACTIVE | ctlplane= | overcloud-full | oooq_control |

    Ok so puppet failed while running via ansible on overcloud-controller-0.


    Debugging via Ansible directly

    Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.

    Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.

    (undercloud) [stack@undercloud ~]$ openstack overcloud config download
    The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config
    (undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config
    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls
    common_deploy_steps_tasks.yaml external_post_deploy_steps_tasks.yaml templates
    Compute global_vars.yaml update_steps_playbook.yaml
    Controller group_vars update_steps_tasks.yaml
    deploy_steps_playbook.yaml post_upgrade_steps_playbook.yaml upgrade_steps_playbook.yaml
    external_deploy_steps_tasks.yaml post_upgrade_steps_tasks.yaml upgrade_steps_tasks.yaml

    Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps.  This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).

    We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:

    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0
    TASK [Run puppet host configuration for step 1] ********************************************************************
    ok: []

    TASK [debug] *******************************************************************************************************
    fatal: []: FAILED! => {
    "changed": false,
    "failed_when_result": true,
    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
    "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend",
    "exception: connect failed",
    "Warning: Undefined variable '::deploy_config_name'; ",
    " (file & line not available)",
    "Warning: Undefined variable 'deploy_config_name'; ",
    "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile
    /base/docker.pp:181:5 on node overcloud-controller-0.localdomain"


    NO MORE HOSTS LEFT *************************************************************************************************
    to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry

    PLAY RECAP ********************************************************************************************************* : ok=56 changed=2 unreachable=0 failed=1

    Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node.  We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).

    If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.


    Debugging via Puppet directly

    Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):

    Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.

    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@
    Warning: Permanently added '' (ECDSA) to the list of known hosts.
    Last login: Fri Feb 9 14:30:02 2018 from gateway
    [heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/
    [heat-admin@overcloud-controller-0 tripleo-config]$ ls
    docker-container-startup-config-step_1.json docker-container-startup-config-step_4.json puppet_step_config.pp
    docker-container-startup-config-step_2.json docker-container-startup-config-step_5.json
    docker-container-startup-config-step_3.json docker-container-startup-config-step_6.json

    The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host

    We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.

    [root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step
    [root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json
    {"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp
    Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain

    Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.

    Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.

    That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.

    by Steve Hardy ( at February 09, 2018 05:04 PM

    February 08, 2018

    Lars Kellogg-Stedman

    Listing iptables rules with line numbers

    You can list iptables rules with rule numbers using the --line-numbers option, but this only works in list (-L) mode. I find it much more convenient to view rules using the output from iptables -S or iptables-save.

    You can augment the output from these commands with rule numbers with the …

    by Lars Kellogg-Stedman at February 08, 2018 05:00 AM

    January 31, 2018

    Adam Young

    Matching Create and Teardown in an Ansible Role

    Nothing lasts forever. Except some developer setups that no-one seems to know who owns, and no one is willing to tear down. I’ve tried to build the code to clean up after myself into my provisioning systems. One pattern I’ve noticed is that the same data is required for building and for cleaning up a cluster. When I built Ossipee, each task had both a create and a teardown stage. I want the same from Ansible. Here is how I’ve made it work thus far.

    The main mechanism I use is a conditional include based on a variable set. Here is the task/main.yaml file for one of my modules:

    - include_tasks: create.yml
      when: not teardown
    - include_tasks: teardown.yml
      when: teardown

    I have two playbooks which call the same role. The playbooks/create.yml file:

    - hosts: localhost
        teardown: false
        -  provision

    and the playbooks/teardown.yaml file:

    - hosts: localhost
        teardown: true
        -  provision

    All of the real work is done in the tasks/create.yml and tasks/teardown.yml files. For example, I need to create a bunch of Network options in Neutron in a particular (dependency driven) order. Teardown needs to be done in the reverse order. Here is the create fragment for the network pieces:

    - name: int_network
        cloud: "{{ cloudname }}"
        state: present
        name: "{{ netname }}_network"
        external: false
      register: osnetwork
    - os_subnet:
        cloud: "{{ cloudname }}"
        state: present
        network_name: "{{ netname }}_network"
        name: "{{ netname }}_subnet"
    - os_router:
        cloud: "{{ cloudname }}"
        state: present
        name: "{{ netname }}_router"
        interfaces: "{{ netname }}_subnet"
        network: public

    To tear this down, I can reverse the order:

    - os_router:
        cloud: rdusalab
        state: absent
        name: "{{ netname }}_router"
    - os_subnet:
        cloud: rdusalab
        state: absent
        network_name: "{{ netname }}_network"
        name: "{{ netname }}_subnet"
    - os_network:
        cloud: rdusalab
        state: absent
        name: "{{ netname }}_network"
        external: false

    As you can see, the two files share a naming convention: name: “{{ netname }}_network” should really be precalcualted in the vars file and then useed in both cases. That is a good future improvement.

    You can see the real value when it comes to lists of objects. For example, to create a set of virtual machines:

    - name: create CFME server
        cloud: "{{ cloudname }}"
        state: present
        name: "cfme.{{ clustername }}"
        key_name: ayoung-pubkey
        timeout: 200
        flavor: 2
        boot_volume: "{{ }}"
          - "{{ securitygroupname }}"
          -  net-id:  "{{ }}"
             net-name: "{{ netname }}_network"
          hostname: "{{ netname }}"
      register: cfme_server

    It is easy to reverse this with the list of host names. In teardown.yml:

    - os_server:
        cloud: "{{ cloudname }}"
        state: absent
        name: "cfme.{{ clustername }}"
      with_items: "{{ cluster_hosts  }}"

    To create the set of resources I can run:

    ansible-playbook   playbooks/create.yml 

    and to clean up

    ansible-playbook   playbooks/teardown.yml 

    This is pattern scales. If you have three roles that all follow this pattern, they can be run in forward order to set up, and reverse order to teardown. However, it does tend to work at odds with Ansible’s Role dependency mechanism: Ansible does not allow you to only specify the dependent roles should be run in reverse in the teardown process.

    by Adam Young at January 31, 2018 07:13 PM

    Deploying an image on OpenStack that is bigger than the available flavors.

    Today I tried to use our local OpenStack instance to deploy CloudForms Management Engine (CFME). Our OpenStack deployment has a set of flavors that all are defined with 20 GB Disks. The CFME image is larger than this, and will not deploy on the set of flavors. Here is how I worked around it.

    The idea is that, instead of booting a server on Nova using an image and a flavor, first create a bootable volume, and use that to launch the virtual machine.

    The command line way to create an 80 GB volume would be:

    openstack volume create --image cfme-rhevm- --size 80 bootable_volume

    But as you will see later, I used ansible to create it instead.

    Uploading the image (downloaded from the portal)

    openstack image create --file ~/Downloads/cfme-rhevm- cfme-rhevm-

    Which takes a little while. Once it is done:

    $ openstack image show cfme-rhevm-
    | Field            | Value                                                                           |
    | checksum         | 52c57210cb8dd2df26ff5279a5b0be06                                                |
    | container_format | bare                                                                            |
    | created_at       | 2018-01-30T21:09:20Z                                                            |
    | disk_format      | raw                                                                             |
    | file             | /v2/images/cfcca613-40d9-44c8-b12f-e0ddc93ab914/file                            |
    | id               | cfcca613-40d9-44c8-b12f-e0ddc93ab914                                            |
    | min_disk         | 0                                                                               |
    | min_ram          | 0                                                                               |
    | name             | cfme-rhevm-                                                           |
    | owner            | fc56aad6163c44dc8beb0c287a975ca3                                                |
    | properties       | direct_url='file:///var/lib/glance/images/cfcca613-40d9-44c8-b12f-e0ddc93ab914' |
    | protected        | False                                                                           |
    | schema           | /v2/schemas/image                                                               |
    | size             | 1072365568                                                                      |
    | status           | active                                                                          |
    | tags             |                                                                                 |
    | updated_at       | 2018-01-30T21:35:30Z                                                            |
    | virtual_size     | None                                                                            |
    | visibility       | private                                                                         |

    I used Ansible to create the volume and the server. This is the fragment from my task.yaml file.

    - name: create CFME volume
        cloud: "{{ cloudname }}"
        image: cfme-rhevm-
        size: 80
        display_name: cfme_volume
      register: cfme_volume
    - name: create CFME server
        cloud: "{{ cloudname }}"
        state: present
        name: "cfme.{{ clustername }}"
        key_name: ayoung-pubkey
        timeout: 200
        flavor: 2
        boot_volume: "{{ }}"
          - "{{ securitygroupname }}"
          -  net-id:  "{{ }}"
             net-name: "{{ netname }}_network"
          hostname: "{{ netname }}"
      register: cfme_server

    The interesting part is the boot_volume: “{{ }} line, which uses the value registered in the volume create step to get the id of the new volume.

    by Adam Young at January 31, 2018 03:44 AM

    Freeing up a Volume from a Nova server that errored

    Trial and error. Its a key part of getting work done in my field, and I make my share of errors. Today, I tried to create a virtual machine in Nova using a bad glance image that I had converted to a bootable volume:

    The error message was:

     {u'message': u'Build of instance d64fdd07-748c-4e27-b212-59e8cef9d6bf aborted: Block Device Mapping is Invalid.', u'code': 500, u'created': u'2018-01-31T03:10:56Z'}

    The VM could not release the volume.

    $  openstack server remove volume d64fdd07-748c-4e27-b212-59e8cef9d6bf de4909df-e95c-4a54-af5c-c24a26146a89
    Can't detach root device volume (HTTP 403) (Request-ID: req-725ce3fa-36e5-4dd8-b10f-7521c91a5c32)

    So I deleted the instance:

      openstack server delete d64fdd07-748c-4e27-b212-59e8cef9d6bf

    But when I went to list the volumes:

    | ID                                   | Name        | Status | Size | Attached to                                                   |
    | de4909df-e95c-4a54-af5c-c24a26146a89 | xxxx        | in-use |   80 | Attached to d64fdd07-748c-4e27-b212-59e8cef9d6bf on /dev/vda  |
    $ openstack volume delete de4909df-e95c-4a54-af5c-c24a26146a89
    Failed to delete volume with name or ID 'de4909df-e95c-4a54-af5c-c24a26146a89': Invalid volume: Volume status must be available or error or error_restoring or error_extending and must not be migrating, attached, belong to a group or have snapshots. (HTTP 400) (Request-ID: req-f651299d-740c-4ac9-9f52-8a603eace8f6)
    1 of 1 volumes failed to delete.

    To unwedge it I need to run:

    $ cinder reset-state --attach-status detached de4909df-e95c-4a54-af5c-c24a26146a89
    Policy doesn't allow volume_extension:volume_admin_actions:reset_status to be performed. (HTTP 403) (Request-ID: req-8bdff31a-7745-4e5e-a449-a5dac5d87f70)
    ERROR: Unable to reset the state for the specified entity(s).

    SO, finally, I had to get an admin account (role admin on any project will work, still…)

    . ~/devel/openstack/salab/rduv3-admin.rc
    cinder  reset-state --attach-status detached de4909df-e95c-4a54-af5c-c24a26146a89

    And now (as my non admin user)

    $ openstack volume list  
    | ID                                   | Name        | Status    | Size | Attached to                                   |
    | de4909df-e95c-4a54-af5c-c24a26146a89 | xxxx        | available |   80 |                                               |
    $ openstack volume delete xxxx
    $ openstack volume list  
    | ID                                   | Name        | Status | Size | Attached to                                   |

    I talked with the Cinder team about the policy for volume_extension:volume_admin_actions:reset_status and they seem to think that it is too unsafe for an average user to be able to perform. Thus, a “force delete” like this would need to be a new operation, or a different flag on an existing operation.

    We’ll work on it.

    by Adam Young at January 31, 2018 03:44 AM

    January 30, 2018

    OpenStack In Production (CERN)

    Keep calm and reboot: Patching recent exploits in a production cloud

    At CERN, we have around 8,500 hypervisors running 36,000 guest virtual machines. These provide the compute resources for both the laboratory's physics program but also for the organisation's administrative operations such as paying bills and reserving rooms at the hostel. These resources are spread over many different server configurations, some of them over 5 years old.

    With the accelerator stopping over the CERN annual closure until mid March, this is a good period to be planning reconfiguration of compute resources such as the migration of our central batch system which schedules the jobs across the central compute resources to a new system based on HTCondor. The compute resources are heavily used but there is more flexibility to drain some parts in the quieter periods of the year when there is not 10PB/month coming from the detectors. However, this year we have had an unexpected additional task to deploy the fixes for the Meltdown and Spectre exploits across the centre.

    The CERN environment is based on Scientific Linux CERN 6 and CentOS 7. The hypervisors are now entirely CentOS 7 based with guests of a variety of operating systems including Windows flavors and CERNVM. The campaign to upgrade involved a number of steps
    • Assess the security risk
    • Evaluate the performance impact
    • Test the upgrade procedure and stability
    • Plan the upgrade campaign
    • Communicate with the users
    • Execute the campaign

    Security Risk

    The CERN environment consists of a mixture of different services, with thousands of projects on the cloud, distributed across two data centres in Geneva and Budapest. 

    Two major risks were identified
    • Services which provided the ability for end users to run their own programs along with others sharing the same kernel. Examples of this are the public login services and batch farms. Public login services provide an interactive Linux environment for physicists to log into from around the world, prepare papers, develop and debug applications and submit jobs to the central batch farms. The batch farms themselves provide 1000s of worker nodes processing the data from CERN experiments by farming event after event to free compute resources. Both of these environments are multi-user and allow end users to compile their own programs and thus were rated as high risk for the Meltdown exploit.
    • The hypervisors provide support for a variety of different types of virtual machines. Different areas of the cloud provide access to different network domains or to compute optimised configurations. Many of these hypervisors will have VMs owned by different end users and therefore can be exposed to the Spectre exploits, even if the performance is such that exploiting the problem would take significant computing time.
    The remaining VMs are for dedicated services without access for end user applications or dedicated bare metal servers for I/O intensive applications such as databases and disk or tape servers.

    There are a variety of different hypervisor configurations which we split down by processor type (in view of the Spectre microcode patches). Each of these needs independent performance and stability checks.

    Processor name(s)
    E5-2630 v3 @ 2.40GHz,E5-2640 v3 @ 2.60GHz
    E5-2630 v4 @ 2.20GHz, E5-2650 v4 @ 2.20GHz
    E5-2650 v2 @ 2.60GHz
    CPU family: 21 Model: 1 Model name: AMD Opteron(TM) Processor 6276 Stepping: 2
    E5-2630L 0 @ 2.00GHz, E5-2650 0 @ 2.00GHz
    E5645 @ 2.40GHz, L5640 @ 2.27GHz, X5660 @ 2.80GHz

    These risks were explained by the CERN security team to the end users in their regular blogs.

    Evaluating the performance impact

    The High Energy Physics community uses a suite called HEPSPEC06 to benchmark compute resources. These are synthetic programs based on the C++ components of SPEC CPU2006 which match the instruction mix of the typical physics programs. With this benchmark, we have started to re-benchmark (the majority of) the CPU models we have in the data centres, both on the physical hosts and on the guests. The measured performance loss across all architectures tested so far is about 2.5% in HEPSPEC06 (a number also confirmed by by one of the LHC experiments using their real workloads) with a few cases approaching 7%. So for our physics codes, the effect of patching seems measurable, but much smaller than many expected. 

    Test the upgrade procedure and stability

    With our environment based on CentOS and Scientific Linux, the deployment of the updates for Meltdown and Spectre were dependent on the upstream availability of the patches. These could be broken down into several parts
    • Firmware for the processors - the microcode_ctl packages provide additional patches to protect against some parts of Spectre. This package proved very dynamic as new processor firmware was being added on a regular basis and it was not always clear when this needed to be applied, the package version would increase but it was not always that this included an update for the particular hardware type. Following through the Intel release notes,  there were combinations such as "HSX C0(06-3f-02:6f) 3a->3b" which explains that the processor description 06-3f-02:6f is upgraded from release 0x3a to 0x3b. The fields are the CPU family, model and stepping from /proc/cpuinfo and the firmware level can be found at /sys/devices/system/cpu/cpu0/microcode/version. A simple script ( was made available to the end users so they could check their systems and this was also used by the administrators to validate the central IT services.
    • For the operating system, we used a second script ( which was derived from the upstream github code at  The team maintaining this package were very responsive incorporating our patches so that other sites could benefit from the combined analysis.

    Communication with the users

    For the cloud, there are several resource consumers.
    • IT service administrators who provide higher level functions on top of the CERN cloud. Examples include file transfer services, information systems, web frameworks and experiment workload management systems. While some are in the IT department, others are representatives of their experiments or supporters for online control systems such as those used to manage the accelerator infrastructure.
    • End users consume cloud resources by asking for virtual machines and using them as personal working environments. Typical cases would be a MacOS user who needs a Windows desktop where they would create a Windows VM and use protocols such as RDP to access it when required.
    The communication approach was as follows:
    • A meeting was held to discuss the risks of exploits, the status of the operating systems and the plan for deployment across the production facilities. With a Q&A session, the major concerns raised were around potential impact on performance and tuning options. 
    • An e-mail was sent to all owners of virtual machine resources informing them of the upcoming interventions.
    • CERN management was informed of the risks and the plan for deployment.
    CERN uses ServiceNow to provide a service desk for tickets and a status board of interventions and incidents. A single entry was used to communicate the current plans and status so that all cloud consumers could go to a single place for the latest information.

    Execute the campaign

    With the accelerator starting up again in March and the risk of the exploits, the approach taken was to complete the upgrades to the infrastructure in January, leaving February to find any residual problems and resolve them. As the handling of the compute/batch part of the infrastructure was relatively straight forward (with only one service on top), we will focus in the following on the more delicate part of hypervisors running services supporting several thousand users in their daily work.

    The layout of our infrastructure with its availability zones (AVZs) determined the overall structure and timeline of the upgrade. With effectively four AVZs in our data centre in Geneva and two AVZs for our remote resources in Budapest, we scheduled the upgrade for the services part of the resources over four days.

    The main zones in Geneva were done one per day, with a break after the first one (GVA-A) in case there were unexpected difficulties to handle on the infrastructure or on the application side. The remaining zones were scheduled on consecutive days (GVA-B and GVA-C), the smaller ones (critical, WIG-A, WIG-B) in sequential order on the last day. This way we upgraded around 400 hosts with 4,000 guests per day.

    Within each zone, hypervisors were divided into 'reboot groups' which were restarted and checked before the next group was handled. These groups were determined by the OpenStack cells underlying the corresponding AVZs. Since some services required to limit the window of service downtime, their hosting servers were moved to the special Group 1, the only one for which we could give a precise start time.

    For each group several steps were performed:
    • install all relevant packages
    • check the next kernel is the desired one
    • reset the BMC (needed for some specific hardware to prevent boot problems)
    • log the nova and ping state of all guests
    • stop all alarming 
    • stop nova
    • shut down all instances via virsh
    • reboot the hosts
    • ... wait ... then fix hosts which did not come back
    • check running kernel and vulnerability status on the rebooted hosts
    • check and fix potential issues with the guests
    Shutting down virtual machines via 'virsh', rather than the OpenStack APIs, was chosen to speed up the overall process -- even if this required to switch off nova-compute on the hosts as well (to keep nova in a consistent state). An alternative to issuing 'virsh' commands directly would be to configure 'libvirt-guests', especially in the context of the question whether guests should be shut down and rebooted (which we did during this campaign) or paused/resumed. This is an option we'll have a look at to prepare for similar campaigns in the future.

    As some of the hypervisors in the cloud had very long uptimes and this was the first time we systematically rebooted the whole infrastructure since the service went to full production about five years ago, we were not quite sure what kind issues to expect -- and in particular at which scale. To our relief, the problems encountered on the hosts hit less than 1% of the servers and included (in descending order of appearance)
    • hosts stuck in shutdown (solved by IPMI reset)
    • libvirtd stuck after reboot (solved by another reboot)
    • hosts without network connectivity (solved by another reboot)
    • hosts stuck in grub during boot (solved by reinstalling grub) 
    On the guest side, virtual machines were mostly ok when the underlying hypervisor was ok as well.
    A few additional cases included
    • incomplete kernel upgrades, so the root partition could not be found (solved by booting back into an older kernel and reinstall the desired kernel)
    • file system issues (solved by running file system repairs)
    So, despite initial worries, we hit no major issues when rebooting the whole CERN cloud infrastructure!


    While these kind of security issues do not arrive very often, the key parts of the campaign follow standard steps, namely assessing the risk, planning the update, communicating with the user community, execution and handling incomplete updates.

    Using cloud availability zones to schedule the deployment allowed users to easily understand when there would be an impact on their virtual machines and encourages good practise to load balance resources.



    • Arne Wiebalck
    • Jan Van Eldik
    • Tim Bell

    by Tim Bell ( at January 30, 2018 01:31 PM

    January 29, 2018

    Adam Young

    Creating an Ansible Inventory file using Jinja templating

    While there are lots of tools in Ansible for generating an inventory file dynamically, in a system like this, you might want to be able to perform additional operations against the same cluster. For example, once the cluster has been running for a few months, you might want to do a Yum update. Eventually, you want to de-provision. Thus, having a remote record of what machines make up a particular cluster can be very useful. Dynamic inventories can be OK, but often it takes time to regenerate the inventory, and that may slow down an already long process, especially during iterated development.

    So, I like to generate inventory files. These are fairly simple files, but they are not one of the supported file types in Ansible. Ansible does support ini files, but the inventory files have maybe lines that are not in key=value format.

    Instead, I use Jinja formatting to generate inventory files, and they are pretty simple to work with.

    UPDATE: I jumped the gun on the inventory file I was generating. The template and completed inventory have been corrected.

    To create the set of hosts, I use the OpenStack server (os_server) task, like this:

    - name: create servers
        cloud: "{{ cloudname }}"
        state: present
        name: "{{ item }}.{{ clustername }}"
        image: rhel-guest-image-7.4-0
        key_name: ayoung-pubkey
        timeout: 200
        flavor: 2
          - "{{ securitygroupname }}"
          -  net-id:  "{{ }}"
             net-name: "{{ netname }}_network" 
          hostname: "{{ netname }}"
      with_items: "{{ cluster_hosts }}"
      register: osservers
    - file:
        path: "{{ config_dir }}"
        state: directory
        mode: 0755
    - file:
        path: "{{ config_dir }}/deployments"
        state: directory
        mode: 0755
    - file:
        path: "{{ cluster_dir }}"
        state: directory
        mode: 0755
    - template:
        src: inventory.ini.j2
        dest: "{{ cluster_dir }}/inventory.ini"
        force: yes
        backup: yes

    A nice thing about this task is, whether it is creating new server or not, it produces the same output, which is a json object that has the server data in an array.

    The following template is my current fragment.

    {% for item in osservers.results %}
    {{ item.server.interface_ip }}
    {% endfor %}
    {% for item in osservers.results %}
    [{{ }}]
    {{ item.server.interface_ip  }}
    {% endfor %}
    {% for item in osservers.results %}
    {% if'idm')  %}
    {{ item.server.interface_ip  }}
    {% endif %}
    {% endfor %}
    ipa_server_password={{ ipa_server_password }}
    ipa_domain={{ clustername }}
    deployment_dir={{ cluster_dir }}
    ipa_realm={{ clustername|upper }}
    ipa_admin_user_password={{  ipa_admin_password }}
    ipa_forwarder={{ ipa_forwarder }}
    lab_nameserver1={{ lab_nameserver1 }}
    lab_nameserver2={{ lab_nameserver2 }}

    I keep the variable definitions in a separate file. This produces an inventory file that looks like this:


    My next step is to create a host group for all of the nodes (node0 node1) based on a shared attribute. I probably will do that by converting the list of hosts to a dictionary keyed by hostname, and have the name of the groups as the value.

    by Adam Young at January 29, 2018 12:31 AM

    January 28, 2018

    Adam Young

    Getting Shade for the Ansible OpenStack modules

    When Monty Taylor and company looked to update the Ansible support for OpenStack, they realized that there was a neat little library waiting to emerge: Shade. Pulling the duplicated code into Shade brought along all of the benefits that a good refactoring can accomplish: fewer cut and paste errors, common things work in common ways, and so on. However, this means that the OpenStack modules are now dependent on a remote library being installed on the managed system. And we do not yet package Shade as part of OSP or the Ansible products. If you do want to use the OpenStack modules for Ansible, here is the “closest to supported” way you can do so.

    The Shade library does not attempt to replace the functionality of the python-*client libraries provided by upstream OpenStack, but instead uses them to do work. Shade is thus more of a workflow coordinator between the clients. Thus, it should not surprise you to find that shade required such libraries as keystoneauth1 and python-keystoneclient. In an OSP12 deployment, these can be found in the rhel-7-server-openstack-12-rpms repository. Thus, as a prerequisite, you need to have this repository enabled for the host where you plan on running the playbooks. If you are setting up a jumphost for this, that jumphost should be running RHEL 7.3, as that has the appropriate versions of all the other required RPMs as well. I tried this on a RHEL 7.4 system, and it turns out it has too late a version of python-urllib3.

    Shade has one additional dependency beyond what is provided with OSP: Munch. This is part of Fedora EPEL and can be installed from the provided link. Then, shade can be installed from RDO.

    Let me be clear that these are not supported packages yet. This is just a workaround to get them installed via RPMs. This is a slightly better solution than using PIP to install and manage your Shade deployment, as some others have suggested. It keeps the set of python code you are running tracked via the RPM database. When a supported version of shade is provided, it should replace the version you install from the above links.

    by Adam Young at January 28, 2018 06:55 PM

    January 26, 2018

    Lars Kellogg-Stedman

    Administrivia: Pelican and theme update

    I've just refreshed the version of Pelican used to generate this blog, along with the associated themes and plugins. It all seems to be working, but if you spot a problem feel free to drop me a line.

    by Lars Kellogg-Stedman at January 26, 2018 05:00 AM

    January 25, 2018

    Adam Young

    Using JSON home on a Keystone server

    Say you have an AUTH_URL like this:

    $ echo $OS_AUTH_URL

    And now you want to do something with it.  You might think you can get the info you want from the /v3 url, but it does not tell you much:

    $ curl $OS_AUTH_URL 
    {"version": {"status": "stable", "updated": "2016-10-06T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.identity-v3+json"}], "id": "v3.7", "links": [{"href": "", "rel": "self"}]}}[ayoung@ayoung541 salab]$

    Not too helpful.  Turns out, though, that there is data, it is just requires the json-home accepts header.

    You access the document like this:

    curl $OS_AUTH_URL -H "Accept: application/json-home"


    I’m not going to past the output: it is huge. 

    Here is how I process it:

    curl $OS_AUTH_URL -H "Accept: application/json-home" | jq '. | .resources '

    Will format somewhat legibly.  To get a specific section, say the endpoint list you can find it in the doc like this:

     "": {
     "href": "/endpoints"

    And to pull it out programatically:

    curl -s $OS_AUTH_URL -H "Accept: application/json-home" | jq '. | .resources | .[""] | .href'

    by Adam Young at January 25, 2018 03:42 PM

    Lars Kellogg-Stedman

    Fun with devicemapper snapshots

    I find myself working with Raspbian disk images fairly often. A typical workflow is:

    • Download the disk image.
    • Mount the filesystem somewhere to check something.
    • Make some changes or install packages just to check something else.
    • Crap I've made changes. which point I need to fetch a new copy …

    by Lars Kellogg-Stedman at January 25, 2018 05:00 AM

    January 24, 2018

    RDO Blog

    Overview of a CI/CD workflow with Zuul

    The upcoming version of Zuul has many new features that allow one to create powerful continuous integration and continuous deployment pipelines.

    This article presents some mechanisms to create such pipelines. As a practical example, I demonstrate the Software Factory project development workflow we use to continously build, test and deliver rpm packages through code review.

    Build job

    The first stage of this workflow is to build a new package for each change.

    Build job definition

    The build job is defined in a zuul.yaml file:

    - job:
        name: sf-rpm-build
        description: Build Software Factory rpm package
        run: playbooks/rpmbuild.yaml
          - software-factory/sfinfo
            - name: mock-host
              label: centos-7

    The required-projects option declare projects that are needed to run the job. In this case, the package metadata, such as the software collection targets are defined in the sfinfo project. This mean that everytime this job is executed, the sfinfo project will be copied to the test instance.

    Extra required-projects can be added per project, for example the cauth package requires the cauth-distgit project to build a working package. The cauth pipeline can be defined as:

    - project:
        name: software-factory/cauth
            - sf-rpm-build:
                  - software-factory/cauth-distgit

    Most of the job parameters can be modified when added to a project pipeline. In the case of the required-projects the list isn't replaced but extended. This means a change on the cauth project results in the sf-rpm-build job running with the sfinfo and cauth-distgit projects.

    Build job playbook

    The build job is an Ansible playbook:

    - hosts: mock-host
        # Get sfinfo location
        sfinfo_path_query: "[?name=='software-factory/sfinfo'].src_dir"
        sfinfo_path: >
          {{ (zuul.projects.values() | list | json_query(sfinfo_path_query))[0] }}
        # Get workspace path to run zuul_rpm_* commands
        sfnamespace_path: "{{ sfinfo_path | dirname | dirname }}"
        - name: Copy rpm-gpg keys
          become: yes
          command: "rsync -a {{ sfinfo_path }}/rpm-gpg/ /etc/pki/rpm-gpg/"
        - name: Run
          command: >
                --distro-info ./software-factory/sfinfo/sf-{{ zuul.branch }}.yaml
                {% for item in zuul['items'] %}
                  --project {{ }}
                {% endfor %}
            chdir: "{{ sfnamespace_path }}"
        - name: Fetch zuul-rpm-build repository
            src: "{{ sfnamespace_path }}/zuul-rpm-build/"
            dest: "{{ zuul.executor.log_root }}/buildset/"
            mode: pull

    First, the variables use JMES query to discover the path of the sfinfo project location on the test instance. Indeed the Zuul executor prepares the workspace using relative paths constructed from the connection hostname. For reference, the playbook starts with a zuul.projects variable like the one below:

          name: software-factory/sfinfo
          src_dir: src/

    Then the job runs the package building command using a loop on Zuul items. This enables the cross repository dependencies feature of Zuul where this job needs to build all the projects that are added as depends-on. Note that this is automatically done by the "tox" job, see the install_sibling task. For reference, the playbook starts with a zuul.items variable like the one below:

        - branch: master
            name: scl/zuul-jobs-distgit
        - branch: master
            name: software-factory/sf-config
        - branch: master
            name: software-factory/sf-ci

    In this example, the depends-on list includes three changes:

    • Pages roles added to zuul-jobs-distgit,
    • Pages jobs configured in sf-config, and
    • Functional tests added to sf-ci.

    The sf-rpm-build job will build a new package for each of these changes.

    The last task fetches the resulting rpm repository to the job logs. Any jobs, playbooks or tasks can synchronize artifacts to the zuul.executor.log_root directory. Having the packages exported with the job logs is convenient for the end users to easily install the packages built in the CI. Moreover, this will also be used by the integration jobs below.

    Integration pipeline

    The second stage of the workflow is to test the packages built by the sf-rpm-build job.

    Share Zuul artifacts between jobs

    Child jobs can inherit data produced by a parent job when using the zuul_return Ansible module. The buildset-artifacts-location role automatically set the artifacts job logs url using this task:

    - name: Define buildset artifacts location
      delegate_to: localhost
          buildset_artifacts_url: "{{ zuul_log_url }}/{{ zuul_log_path }}/buildset"

    Software Factory configures this role along the upload-logs to transparently define this buildset_artifacts_url variable when there is a buildset directory in the logs.

    Integration pipeline definition

    The integration pipeline is defined in a zuul.yaml file:

    - project-template:
        name: sf-jobs
            - sf-rpm-build
            - sf-ci-functional-minimal:
                  - sf-rpm-build
            - sf-ci-upgrade-minimal:
                  - sf-rpm-build
            - sf-ci-functional-allinone:
                  - sf-rpm-build
            - sf-ci-upgrade-allinone:
                  - sf-rpm-build

    The functional and upgrade jobs use the dependencies option to declare that they only run after the rpm-build job is finished. The functional and upgrade jobs use new packages using the task below:

    - name: Add CI packages repository
        name: "zuul-built"
        baseurl: "{{ buildset_artifacts_url }}"
        gpgcheck: "0"
      become: yes

    Projects definition

    The sfinfo project is a config-project in Zuul configuration. It enables the defining of all the projects' jobs without requiring the addition of a zuul.yaml file in each project. Config-projects are allowed to configure foreign projects' jobs, for example:

    - project:
        name: scl/zuul-jobs-distgit
          - sf-jobs

    A good design for this workflow defines common jobs in a dedicated repository and the common pipeline definitions in a config-projects. Untrusted-projects can still add local jobs if needed and can even add dependencies to the common pipelines. For example, the cauth project extends the required-projects for the sf-rpm-build.

    Deployment pipeline

    When a change succeeds the integration tests the reviewer can approve it to trigger the deployment pipeline. The first thing to understand is how to use secrets in the deployment job.

    Using secrets in jobs

    Zuul can securely manage secrets using public key cryptography. Zuul manages a private key for each project and the user can encrypt secrets with the public key to store them in the repository along with the job. That means encryption is a one-way operation for the user and only the Zuul scheduler can decrypt the secret.

    To create a new secret the user runs the encrypt_secret tool:

    # --infile <zuul-web-url>/keys/<tenant-name> <project-name>
    - secret:
        name: <secret-name>
          <variable-name>: !encrypted/pkcs1-oaep

    Once a secret is added to a job the playbook will have access to its decrypted content. However, there are a few caveats:

    • The secret and the playbook need to be defined in a single job stored in the same project. Note that this may change in the future.
    • If the secret is defined in an untrusted-project, then the job is automatically converted to post-review. That means jobs using secrets can only run in post, periodic or release pipelines. This prevents speculative job modifications from leaking the secret content.
    • Alternatively, if the secret is defined in a config-project, then the job can be used in any pipeline because config-projects don't allow speculative execution on new patchset.

    Deployment pipeline definition

    In the Software Factory project, the deployment is a koji build and is performed as part of the gate pipeline. That means the change isn't merged if it is not deployed. Another strategy is to deploy in the post pipeline after the change is merged, or in the release pipeline after a tag is submitted.

    The deployment pipeline is defined as below:

    - project-template:
        name: sf-jobs
          queue: sf
            - sf-rpm-build
            - sf-ci-functional-minimal:
                  - sf-rpm-build
            - sf-ci-upgrade-minimal:
                  - sf-rpm-build
            - sf-ci-functional-allinone:
                  - sf-rpm-build
            - sf-ci-upgrade-allinone:
                  - sf-rpm-build
            - sf-rpm-publish:
                  - sf-ci-functional-minimal
                  - sf-ci-upgrade-minimal
                  - sf-ci-functional-allinone
                  - sf-ci-upgrade-allinone

    The deployment pipeline needs to use the queue option to group all the approved changes in dependent order. When multiple changes are approved in parallel, they will all be tested together before being merged, as if they were submitted with a depends-on relationship.

    The deployment pipeline is similar to the integration pipeline, it just adds a publish job that will only run if all the integration tests succeed. This ensures that changes are consistently tested with the projects' current state before being deployed.

    Deployment job definition

    The job is declared in a zuul.yaml file as below:

    - job:
        name: sf-rpm-publish
        description: Publish Software Factory rpm to koji
        run: playbooks/rpmpublish.yaml
        hold-following-changes: true
          - software-factory/sfinfo
          - sf_koji_configuration

    This job is using the hold-following-changes setting to ensure that only the top of the gate gets published. If the deployement is happening in the post or release pipeline, then this setting can be replaced by a semaphore instead, for example:

    - job:
        name: deployment
        semaphore: production-access
    - semaphore:
        name: production-access
        max: 1

    This prevents concurrency issues when multiple changes are approved in parallel.

    Zuul concepts summary

    This article covered the following concepts:

    • Project types:

      • config-projects: hold deployment secrets and set projects' pipelines.
      • untrusted-projects: the projects being tested and deployed.
    • Playbook variables:

      • zuul.projects: the projects installed on the test instance,
      • zuul.items: the list of changes being tested with depends-on,
      • zuul.executor.log_root: the location of job artifacts, and
      • zuul_return: an Ansible module to share data between jobs.
    • Job options:

      • required-projects: the list of projects to copy on the test instance,
      • dependencies: the list of jobs to wait for,
      • secret: the deployment job's secret,
      • post-review: prevents a job from running speculatively,
      • hold-following-changes: makes dependent pipelines run in serial, and
      • semaphore: prevents concurrent deployment of different changes.
    • Pipeline options:

      • Job settings can be modified per project, and
      • queue makes all the projects depend on each other automatically.


    To experiment Zuul by yourself, follow this deployment guide written by Fabien in this previous article.

    Zuul can be used to effectively manage complex continous integration and deployment pipelines with powerfull cross repository dependencies management.

    This article presented the Software Factory workflow where rpm packages are being continously built, tested and delivered through code review. A similar workflow can be created for other types of projects such as golang or container based software.

    by tristanC at January 24, 2018 01:19 PM

    Lars Kellogg-Stedman

    Safely restarting an OpenStack server with Ansible

    The other day on #ansible, someone was looking for a way to safely shut down a Nova server, wait for it to stop, and then start it up again using the openstack cli. The first part seemed easy:

    - hosts: myserver
        - name: shut down the server
          command: poweroff
          become: true …

    by Lars Kellogg-Stedman at January 24, 2018 05:00 AM

    January 18, 2018

    RDO Blog

    RDO's infrastructure server metrics are now available

    Reposted from post by David Moreau Simard

    We have historically been monitoring RDO's infrastructure through Sensu and it has served us well to pre-emptively detect issues and maximize our uptime.

    At some point, Software Factory grew an implementation of Grafana, InfluxDB and Telegraf in order to monitor the health of the servers, not unlike how upstream's openstack-infra leverages cacti. This implementation was meant to eventually host graphs such as the ones for Zuul and Nodepool upstream.

    While there are still details to be ironed out for the Zuul and Nodepool data collection, there was nothing preventing us from just deploying telegraf everywhere just for the general server metrics. It's one standalone package and one configuration file, that's it.

    Originally, we had been thinking about feeding the Sensu metric data to Influxdb … but why even bother if it's there for free in Software Factory ? So here we are.

    The metrics are now available here We will use this as a foundation to improve visibility into RDO's infrastructure, make it more "open" and accessible in the future.

    We're not getting rid of Sensu although we may narrow it's scope to keep some of the more complex service and miscellaneous monitoring that we need to be doing. We'll see what time has in store for us.

    Let me know if you have any questions !

    by Rich Bowen at January 18, 2018 09:12 PM

    January 11, 2018

    RDO Blog

    Summary of rdopkg development in 2017

    During the year of 2017, 10 contributors managed to merge 146 commits into rdopkg.

    3771 lines of code were added and 1975 lines deleted across 107 files.

    54 unit tests were added on top of existing 32 tests - an increase of 169 % to total of 86 unit tests.

    33 scenarios for 5 core rdopkg features were added in new feature tests spanning total of 228 test steps.

    3 minor releases increased version from 0.42 to 0.45.0.

    Let's talk about the most significant improvements.


    rdopkg started as a developers' tool, basically a central repository to accumulate RPM packaging automation in a reusable manner. Quickly adding new features was easy, but making sure existing functionality works consistently as code is added and changed proved to be much greater challenge.

    As rdopkg started shifting from developers' powertool to a module used in other automation systems, unevitable breakages started to become a problem and prompted me to adapt development accordingly. As a first step, I tried to practice Test-Driven Development (TDD) as opposed to writing tests after a breakage to prevent specific case. Unit tests helped discover and prevent various bugs introduced by new code, but testing complex behaviors was a frustrating experience where most of development time was spent on writing units tests for cases they weren't meant to cover.

    Sounds like using a wrong tool for the job, right? And so I opened a rather urgent rdopkg RFE: test actions in a way that doesn't suck and started researching what cool kids use to develop and test python software without suffering.

    Behavior-Driven Development

    It would seem that cucumber started quite a revolution of Behavior-Driven Development (BDD) and I really like Gherkin, the Business Readable, Domain Specific Language that lets you describe software's behaviour without detailing how that behaviour is implemented. Gherkin serves two purposes — documentation and automated tests.

    After some more research on python BDD tools, I liked behave's implementation, documentation and community the most so I integrated it into rdopkg and started using feature tests. They make it easy to describe and define expected behavior before writing code. New features now start with feature scenario which can be reviewed before writing any code. Covering existing behavior with feature tests helps ensuring they are both preserved and well defined/explained/documented. Big thanks goes to Jon Schlueter who contributed huge number of initial feature tests for core rdopkg features.

    Here is an example of rdopkg fix scenario:

        Scenario: rdopkg fix
            Given a distgit
            When I run rdopkg fix
            When I add description to .spec changelog
            When I run rdopkg --continue
            Then spec file contains new changelog entry with 1 lines
            Then new commit was created
            Then rdopkg state file is not present
            Then last commit message is:
                - Description of a change

    Proper CI/gating

    Thanks to Software Factory, zuul and gerrit, every rdopkg change now needs to pass following automatic gate tests before it can be merged:

    • unit tests (python 2, python 3, Fedora, EPEL, CentOS)
    • feature tests (python 2, python 3, Fedora, EPEL, CentOS)
    • integration tests
    • code style check

    In other words, master is now significantly harder to break!

    Tests are managed as individual tox targets for convenience.

    Paying back the Technical Debt

    I tried to write rdopkg code with reusability and future extension in mind, yet in one point of development with big influx of new features/modifications, rdopkg approached critical mass of technical debt where it got into spiral of new functionality breaking existing functionality and with each fix two new bugs surfaced. This kept happening so I stopped adding new stuff and focused on ensuring rdopkg keeps doing what people use it for before extending(breaking) it further. This required quite a few core code refactors, proper integration of features that were hacked in on the clock, as well as leveraging new tools like software factory CI pipeline, and behave described above. But I think it was a success and rdopkg paid its technical debt in 2017 and is ready to face whatever community throws at it in near and far future.


    Join Software Factory project

    rdopkg became a part of Software Factory project and found a new home alongside DLRN.

    Software Factory is an open source, software development forge with an emphasis on collaboration and ensuring code quality through Continuous Integration (CI). It is inspired by OpenStack's development workflow that has proven to be reliable for fast-changing, interdependent projects driven by large communities. Read more in Introducing Software Factory.

    Specifically, rdopkg leverages following Software Factory features:

    rdopkg repo is still mirrored to github and bugs are kept in Issues tracker there as well because github is accessible public open space.

    Did I meantion you can login to Software Factory using github account?

    Finally, big thanks to Javier Peña, who paved the way towards Software Factory with DLRN.

    Continuous Integration

    rdopkg has been using human code reviews for quite some time, and it proved very useful even though I often +2/+1 my own reviews due to lack of reviewers. However, people unevitably make mistakes. There are decent unit and feature tests now to detect mistakes, so we fight human error with computing power and automation.

    Each review and thus each code change to rdopkg is gated - all unit tests, feature tests, integration tests and code style checks need to pass before human reviewers consider accepting the change.

    Instead of setting up machines and testing environments and installing requirements and waiting for tests to pass, this boring process is now automated on supported distributions and humans can focus on the changes themselves.

    Integration with Fedora, EPEL and CentOS

    rdopkg is now finally available directly from Fedora/EPEL repositories, so install instructions on Fedora 25+ systems boiled down to:

    dnf install rdopkg

    On CentOS 7+, EPEL is needed:

    yum install epel-release
    yum install rdopkg

    Fun fact: to update Fedora rdopkg package, I use rdopkg:

    fedpkg clone rdopkg
    cd rdopkg
    rdopkg new-version -bN
    fedpkg mockbuild
    # testing
    fedpkg push
    fedpkg build
    fedpkg update

    So rdopkg is officially packaging itself while also being packaged by itself.

    Please nuke jruzicka/rdopkg copr if you were using it previously, it is now obsolete.


    rdopkg documentation was cleand up, proof-read, extended with more details and updated with latest information and links.

    Feature scenarios are now available as man pages thanks to mhu.

    Packaging and Distribution

    Python 3 compatibility

    By popular demand, rdopkg now supports Python 3. There are Python 3 unit tests and python3-rdopkg RPM package.

    Adopt pbr for Versioning

    Most of initial patches rdopkg was handling in the very beginning were related to distutils and pbr, the OpenStack packaging meta-library, specifically making it work on a distribution with integrated package management and old conservative packages.

    Amusingly, pbr was integrated into rdopkg (well, it actually does solve some problems aside from creating new ones) and in order to release the new rdopkg version with pbr on CentOS/EPEL 7, I had to disable hardcoded pbr>=2.1.0 checks on update of python-pymod2pkg because older version of pbr is available from EPEL 7. I removed the check (in two different places) as I did so many times before and it works fine.

    As a tribute to all the fun I had with pbr and distutils, here is a link to my first nuke bogus requirements patch of 2018.

    Aside from being consistent with OpenStack related projects, rdopkg adopted strict sematic versioning that pbr uses, which means that releases are always going to have 3 version numbers from now on:

    0.45 -> 0.45.0
    1.0  -> 1.0.0

    And More!

    Aside from the big changes mentioned above, large amount of new feature tests and numerous not-so-exciting fixes, here is a list of changes might be worth mentioning:

    • unify rdopkg patch and rdopkg update-patches and use alias
    • rdopkg pkgenv shows more information and better color coding for easy telling of a distgit state and branches setup
    • preserve Change-Id when amending a commit
    • allow fully unattended runs of core actions.
    • commit messages created by all rdopkg actions are now clearer, more consistent and can be overriden using -H/--commit-header-file.
    • better error messages on missing patches in all actions
    • git config can be used to override patches remote, pranch, user name and email
    • improved handling of patches_base and patches_ignore including tests
    • improved handling of %changelog
    • improved new/old patcehs detection
    • improved packaging as suggested in Fedora review
    • improved naming in git and specfile modules
    • properly handle state files
    • linting cleanup and better code style checks
    • python 3 support
    • improve unicode support
    • handle VX.Y.Z tags
    • split bloated utils.cmd into utils.git module
    • merge legacy rdopkg.utils.exception so there is only single module for exceptions now
    • refactor unreasonable default atomic=False affecting action definitions
    • remove legacy rdopkg coprbuild action

    Thank you, rdopkg community!

    January 11, 2018 06:23 PM

    January 10, 2018

    RDO Blog

    RDO Community Blogposts

    If you've missed out on some of the great RDO Community content over the past few weeks while you were on holiday, not to worry. I've gathered the recent blogposts right here for you. Without further ado…

    New TripleO quickstart cheatsheet by Carlos Camacho

    I have created some cheatsheets for people starting to work on TripleO, mostly to help them to bootstrap a development environment as soon as possible.


    Using Ansible for Fernet Key Rotation on Red Hat OpenStack Platform 11 by Ken Savich, Senior OpenStack Solution Architect

    In our first blog post on the topic of Fernet tokens, we explored what they are and why you should think about enabling them in your OpenStack cloud. In our second post, we looked at the method for enabling these.


    Automating Undercloud backups and a Mistral introduction for creating workbooks, workflows and actions by Carlos Camacho

    The goal of this developer documentation is to address the automated process of backing up a TripleO Undercloud and to give developers a complete description about how to integrate Mistral workbooks, workflows and actions to the Python TripleO client.


    Know of other bloggers that we should be including in these round-ups? Point us to the articles on Twitter or IRC and we'll get them added to our regular cadence.

    by Mary Thengvall at January 10, 2018 02:34 AM

    January 05, 2018

    Carlos Camacho

    New TripleO quickstart cheatsheet

    I have created some cheatsheets for people starting to work on TripleO, mostly to help them to bootstrap a development environment as quickly as possible.

    The previous version of this cheatsheet series was used in several community conferences (FOSDEM,, now, they are deprecated as the way TripleO should be deployed changed considerably last months.

    Here you have the latest version:

    The source code of these bookmarks is available as usual on GitHub

    And this is the code if you want to execute it directly:

    # 01 - Create the toor user.
    sudo useradd toor
    echo "toor:toor" | chpasswd
    echo "toor ALL=(root) NOPASSWD:ALL" \
      | sudo tee -a /etc/sudoers.d/toor
    sudo chmod 0440 /etc/sudoers.d/toor
    su - toor
    # 02 - Prepare the hypervisor node.
    mkdir .ssh
    ssh-keygen -t rsa -N "" -f .ssh/id_rsa
    cat .ssh/ >> .ssh/authorized_keys
    sudo bash -c "cat .ssh/ \
      >> /root/.ssh/authorized_keys"
    sudo bash -c "echo '' \
      >> /etc/hosts"
    export VIRTHOST=
    sudo yum groupinstall "Virtualization Host" -y
    sudo yum install git lvm2 lvm2-devel -y
    ssh root@$VIRTHOST uname -a
    # 03 - Clone repos and install deps.
    git clone \
    chmod u+x ./tripleo-quickstart/
    bash ./tripleo-quickstart/ \
    sudo setenforce 0
    # 04 - Configure the TripleO deployment with Docker and HA.
    export CONFIG=~/deploy-config.yaml
    cat > $CONFIG << EOF
      - name: control_0
        flavor: control
        virtualbmc_port: 6230
      - name: compute_0
        flavor: compute
        virtualbmc_port: 6231
    node_count: 2
    containerized_overcloud: true
    delete_docker_cache: true
    enable_pacemaker: true
    run_tempest: false
    extra_args: >-
      --libvirt-type qemu
      -e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml
      -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
    # 05 - Deploy TripleO.
    export VIRTHOST=
    bash ./tripleo-quickstart/ \
          --clean          \
          --release master \
          --teardown all   \
          --tags all       \
          -e @$CONFIG      \

    Happy TripleOing!!!

    by Carlos Camacho at January 05, 2018 12:00 AM

    December 21, 2017

    Red Hat Stack

    Using Ansible for Fernet Key Rotation on Red Hat OpenStack Platform 11

    In our first blog post on the topic of Fernet tokens, we explored what they are and why you should think about enabling them in your OpenStack cloud. In our second post, we looked at the method for enabling these

    Fernet tokens in Keystone are fantastic. Enabling these, instead of UUID or PKI tokens, really does make a difference in your cloud’s performance and overall ease of management. I get asked a lot about how to manage keys on your controller cluster when using Fernet. As you may imagine, this could potentially take your cloud down if you do it wrong. Let’s review what Fernet keys are, as well as how to manage them in your Red Hat OpenStack Platform cloud.

    freddy-marschall-186922Photo by Freddy Marschall on Unsplash


    • A Red Hat OpenStack Platform 11 director-based deployment
    • One or more controller nodes
    • Git command-line client

    What are Fernet Keys?

    Fernet keys are used to encrypt and decrypt Fernet tokens in OpenStack’s Keystone API. These keys are stored on each controller node, and must be available to authenticate and validate users of the various OpenStack components in your cloud.

    Any given implementation of keystone can have (n)keys based on the max_active_keys setting in /etc/keystone/keystone.conf. This number will include all of the types listed below.

    There are essentially three types of keys:


    Primary keys are used for token generation and validation. You can think of this as the active key in your cloud. Any time a user authenticates, or is validated by an OpenStack API, these are the keys that will be used. There can only be one primary key, and it must exist on all nodes (usually controllers) that are running the keystone API. The primary key is always the highest indexed key.


    Secondary keys are only used for token validation. These keys are rotated out of primary status, and thus are used to validate tokens that may exist after a new primary key has been created. There can be multiple secondary keys, the oldest of which will be deleted based on your max_active_keys setting after each key rotation.


    These keys are always the lowest indexed keys (0). Whenever keys are rotated, this key is promoted to a primary key at the highest index allowable by max_active_keys. These keys exist to allow you to copy them to all nodes in your cluster before they’re promoted to primary status. This avoids the potential issue where keystone fails to validate a token because the key used to encrypt it does not yet exist in /etc/keystone/fernet-keys.

    The following example shows the keys that you’d see in /etc/keystone/fernet-keys, with max_active_keys set to 4.

    0 (staged: the next primary key)
    1 (primary: token generation & validation)

    Upon performing a key rotation, our staged key (0), will be the new primary key (2), while our old primary key (1), will be moved to secondary status (1).

    0 (staged: the next primary key)
    1 (secondary: token validation)
    2 (primary: token generation & validation)

    We have three keys here, so yet another key rotation will produce the following result:

    0 (staged: the next primary key)
    1 (secondary: token validation)
    2 (secondary: token validation)
    3 (primary: token generation & validation)

    Our staged key (0), now becomes our primary key (3). Our old primary key (2), now becomes a secondary key (2), and (1) remains a secondary key.

    We now have four keys, the number we’ve set in max_active_keys. One more final rotation would produce the following:

    0 (staged: the next primary key)
    1 (deleted)
    2 (secondary: token validation)
    3 (secondary: token validation)
    4 (primary: token generation & validation)

    Our oldest key, secondary (1), is deleted. Our previously staged key (0), is moved to primary (4) status.  A new staged key (0) is created. And finally our old primary key (3) is moved to secondary status.

    If you haven’t noticed this by now, rotating keys will always remove the key with the lowest index, excluding 0 — up to your max_active_keys. Additionally, note that you must be careful to set your max_active_keys configuration setting to something that makes sense, given your token lifetime and how often you plan to rotate your keys.

    When to rotate?

    uros-jovicic-322314Photo by Uroš Jovičić on Unsplash

    The answer to this question would probably be different for most organizations. My take on this is simply: if you can do it safely, why not automate it and do it on a regular basis? Your threat model and use-case would normally dictate this or you may need to adhere to certain encryption and key management security controls in a given compliance framework. Whatever the case, I think about regular key rotation as a best-practices security measure. You always want to limit the amount of sensitive data, in this case Fernet tokens, encrypted with a single version of any given encryption key. Rotating your keys on a regular basis creates a smaller exposure surface for your cloud and your users.

    How many keys do you need active at one time? This all depends on how often you plan to rotate them, as well as how long your token lifetime is. The answer to this can be expressed in the following equation:

    fernet-keys = token-validity(hours) / rotation-time(hours) + 2

    Let’s use an example of rotation every 8 hours, with a default token lifetime of 24 hours. This would be

    24 hours / 8 hours + 2 = 5

    Five keys on your controllers would ensure that you always had an active set of keys for your cloud. With this in mind, let’s look at way to rotate your keys using Ansible.

    Rotating Fernet keys

    So you may be wondering, how does one automate this process? You can image that this process can be painful and prone to error if done by hand. While you could use the fernet_rotate command to do this on each node manually, why would you?

    Let’s look at how to do this with Ansible, Red Hat’s awesome tool for automation. If you’re new to Ansible, please do yourself a favor and check out this quick-start video.

    We’ll be using an Ansible role, created by my fellow Red Hatter Juan Antonio Osorio (Ozz), one of the coolest guys I know. This is just one way of doing this. For a Red Hat OpenStack Platform install you should contact Red Hat support to review your options and support implications. And of course, your results may vary so be sure to test out on a non-production install!

    Let’s start by logging into your Red Hat OpenStack director node as the stack user, and creating a roles directory in /home/stack:

    $ cat << EOF > ~/rotate.yml
    - hosts: controller 
      become: true 
        - tripleo-fernet-keys-rotation

    We need to source our stackrc, as we’ll be operating on our controller nodes in the next step

    $ source ~/stackrc

    Using a dynamic inventory from /usr/bin/tripleo-ansible-inventory, we’ll run this playbook and rotate the keys on our controllers

    $ ansible-playbook -i /usr/bin/tripleo-ansible-inventory rotate.yml

    Ansible Role Analysis

    What happened? Looking at Ansible’s output, you’ll note that several tasks were performed. If you’d like to see these tasks, look no further than /home/stack/roles/tripleo-fernet-keys-rotation/tasks/main.yml:

    This task runs a python script,, in ~/roles/tripleo-ansible-inventory/files, that creates a new fernet key:

    - name: Generate new key
     register: new_key_register
     run_once: true
    This task will take the output of the previous task, from stdout, and register it as the new_key.
    - name: Set new key fact
     new_key: "{{ new_key_register.stdout }}"

    Next, we get a sorted list of the keys that currently exist in /etc/keystone/fernet-keys

    - name: Get current primary key index
     shell: ls /etc/keystone/fernet-keys | sort -r | head -1
     register: current_key_index_register

    Let’s set the next primary key index

    - name: Set next key index fact
     next_key_index: "{{ current_key_index_register.stdout|int + 1 }}"

    Now we’ll move the staged key to the new primary key

    - name: Move staged key to new index
     command: mv /etc/keystone/fernet-keys/0 /etc/keystone/fernet-keys/{{ next_key_index }}

    Next, let’s set our new_key to the new staged key

    - name: Set new key as staged key
     content: "{{ new_key }}"
     dest: /etc/keystone/fernet-keys/0
     owner: keystone
     group: keystone
     mode: 0600

    Finally, we’ll reload (not restart) httpd on the controller, allowing keystone to load the new keys

    - name: Reload httpd
     name: httpd
     state: reloaded


    Now that we have a way to automate rotation of our keys, it’s time to schedule this automation. There are several ways you could do this:


    You could, but why?

    Systemd Realtime Timers

    Let’s create the systemd service that will run our playbook:

    cat << EOF > /etc/systemd/system/fernet-rotate.service
    Description=Run an Ansible playbook to rotate fernet keys on the overcloud
    ExecStart=/usr/bin/ansible-playbook \
      -i /usr/bin/tripleo-ansible-inventory /home/stack/rotate.yml

    Now we’ll create a timer with the same name, only with .timer as the suffix, in /etc/systemd/system on the director node:

    cat << EOF > /etc/systemd/system/fernet-rotate.timer
    Description=Timer to rotate our Overcloud Fernet Keys weekly

    Ansible Tower

    I like how your thinking! But that’s a topic for another day.

    Red Hat OpenStack Platform 12

    Red Hat OpenStack Platform 12 provides support for key rotation via Mistral. Learn all about Red Hat OpenStack Platform 12 here.

    What about logging?

    Ansible to the rescue!

    Ansible will use the log_path configuration option from /etc/ansible/ansible.cfg, ansible.cfg in the directory of the playbook, or $HOME/.ansible.cfg. You just need to set this and forget it.

    So let’s enable this service and timer, and we’re off to the races:

    $ sudo systemctl enable fernet-rotate.service
    $ sudo systemctl enable fernet-rotate.timer

    Credit: Many thanks to Lance Bragstad and Dolph Matthews for the key rotation methodology.

    by Ken Savich, Senior OpenStack Solution Architect at December 21, 2017 02:08 AM

    December 18, 2017

    Carlos Camacho

    Automating Undercloud backups and a Mistral introduction for creating workbooks, workflows and actions

    The goal of this developer documentation is to address the automated process of backing up a TripleO Undercloud and to give developers a complete description about how to integrate Mistral workbooks, workflows and actions into the Python TripleO client.

    This tutorial will be divided into several sections:

    1. Introduction and prerequisites
    2. Undercloud backups
    3. Creating a new OpenStack CLI command in python-tripleoclient (openstack undercloud backup).
    4. Creating Mistral workflows for the new python-tripleoclient CLI command.
    5. Give support for new Mistral environment variables when installing the undercloud.
    6. Show how to test locally the changes in python-tripleoclient and tripleo-common.
    7. Give elevated privileges to specific Mistral actions that need to run with elevated privileges.
    8. Debugging actions
    9. Unit tests
    10. Why all previous sections are related to Upgrades?

    1. Introduction and prerequisites

    Let’s assume you have a TripleO development environment healthy and working properly. All the commands and customization we are going to run will run in the Undercloud, as usual logged in as the stack user and having sourced the stackrc file.

    Then let’s proceed by cloning the repositories we are going to work with in a temporary folder:

    mkdir dev-docs
    cd dev-docs
    git clone
    git clone
    git clone
    • python-tripleoclient: Will define the OpenStack CLI commands.
    • tripleo-common: Will have the Mistral logic.
    • instack-undercloud: Allows to update and create mistral environments to store configuration details needed when executing Mistral workflows.

    2. Undercloud backups

    Most of the Undercloud back procedure is available in the TripleO official documentation site.

    We will focus on the automation of backing up the resources required to restore the Undercloud in case of a failed upgrade.

    • All MariaDB databases on the undercloud node
    • MariaDB configuration file on undercloud (so we can restore databases accurately)
    • All glance image data in /var/lib/glance/images
    • All swift data in /srv/node
    • All data in stack users home directory

    For doing this we need to be able to:

    • Connect to the database server as root.
    • Dump all databases to file.
    • Create a filesystem backup of several folders (and be able to access folders with restricted access).
    • Upload this backup to a swift container to be able to get it from the TripleO web UI.

    3. Creating a new OpenStack CLI command in python-tripleoclient (openstack undercloud backup).

    The first action needed is to be able to create a new CLI command for the OpenStack client. In this case, we are going to implement the openstack undercloud backup command.

    cd dev-docs
    cd python-tripleoclient

    Let’s list the files inside this folder:

    [stack@undercloud python-tripleoclient]$ ls
    AUTHORS           doc                  
    babel.cfg         LICENSE                        test-requirements.txt
    bindep.txt        zuul.d                         tools
    build             README.rst                     tox.ini
    ChangeLog         releasenotes                   tripleoclient
    config-generator  requirements.txt               
    CONTRIBUTING.rst  setup.cfg

    Once inside the python-tripleoclient folder we need to check the following file:

    setup.cfg: This file defines all the CLI commands for the Python TripleO client. Specifically, we will need at the end of this file our new command definition:

    undercloud_backup = tripleoclient.v1.undercloud_backup:BackupUndercloud

    This means that we have a new command defined as undercloud backup that will instantiate the BackupUndercloud class defined in the file tripleoclient/v1/

    For further details related to this class definition please go to the gerrit review.

    Now, having our class defined we can call other methods to invoke Mistral in this way:

    clients =
    files_to_backup = ','.join(list(set(parsed_args.add_files_to_backup)))
    workflow_input = {
        "sources_path": files_to_backup
    output = undercloud_backup.prepare(clients, workflow_input)

    So forth, we will call the undercloud_backup.prepare method defined in the file tripleoclient/workflows/ wich will call the Mistral workflow:

    def prepare(clients, workflow_input):
        workflow_client = clients.workflow_engine
        tripleoclients = clients.tripleoclient
        with tripleoclients.messaging_websocket() as ws:
            execution = base.start_workflow(
            for payload in base.wait_for_messages(workflow_client, ws, execution):
                if 'message' in payload:
                    return payload['message']

    In this case, we will create a loop within the tripleoclient and wait until we receive a message from the Mistral workflow tripleo.undercloud_backup.v1.prepare_environment that indicates if the invoked workflow ended correctly.

    4. Creating Mistral workflows for the new python-tripleoclient CLI command.

    The next step is to define the tripleo.undercloud_backup.v1.prepare_environment Mistral workflow, all the Mistral workbooks, workflows and actions will be defined in the tripleo-common repository.

    Let’s go inside tripleo-common

    cd dev-docs
    cd tripleo-common

    And see it’s conent:

    [stack@undercloud tripleo-common]$ ls
    AUTHORS           doc                README.rst        test-requirements.txt
    babel.cfg         HACKING.rst        releasenotes      tools
    build             healthcheck        requirements.txt  tox.ini
    ChangeLog         heat_docker_agent  scripts           tripleo_common
    container-images  image-yaml         setup.cfg         undercloud_heat_plugins
    contrib           LICENSE            workbooks
    CONTRIBUTING.rst  playbooks          sudoers           zuul.d

    Again we need to check the following file:

    setup.cfg: This file defines all the Mistral actions we can call. Specifically, we will need at the end of this file our new actions:

    tripleo.undercloud.get_free_space = tripleo_common.actions.undercloud:GetFreeSpace
    tripleo.undercloud.create_backup_dir = tripleo_common.actions.undercloud:CreateBackupDir
    tripleo.undercloud.create_database_backup = tripleo_common.actions.undercloud:CreateDatabaseBackup
    tripleo.undercloud.create_file_system_backup = tripleo_common.actions.undercloud:CreateFileSystemBackup
    tripleo.undercloud.upload_backup_to_swift = tripleo_common.actions.undercloud:UploadUndercloudBackupToSwift

    4.1. Action definition

    Let’s take the first action to describe it’s definition, tripleo.undercloud.get_free_space = tripleo_common.actions.undercloud:GetFreeSpace

    We have defined the action named as tripleo.undercloud.get_free_space which will instantiate the class GetFreeSpace defined in the file tripleo_common/actions/ file.

    If we open tripleo_common/actions/ we can see the class definition as:

    class GetFreeSpace(base.Action):
        """Get the Undercloud free space for the backup.
           The default path to check will be /tmp and the default minimum size will
           be 10240 MB (10GB).
        def __init__(self, min_space=10240):
            self.min_space = min_space
        def run(self, context):
            temp_path = tempfile.gettempdir()
            min_space = self.min_space
            while not os.path.isdir(temp_path):
                head, tail = os.path.split(temp_path)
                temp_path = head
            available_space = (
                (os.statvfs(temp_path).f_frsize * os.statvfs(temp_path).f_bavail) /
                (1024 * 1024))
            if (available_space < min_space):
                msg = "There is no enough space, avail. - %s MB" \
                      % str(available_space)
                return actions.Result(error={'msg': msg})
                msg = "There is enough space, avail. - %s MB" \
                      % str(available_space)
                return actions.Result(data={'msg': msg})

    In this specific case this class will check if there is enough space to perform the backup. Later we will be able to inkove action as

    mistral run-action tripleo.undercloud.get_free_space

    or use it workbooks.

    4.2. Workflow definition.

    Once we have defined all our new actions, we need to orchestrate them in order to have a fully working Mistral workflow.

    All tripleo-comon workbooks are defined in the workbooks folder.

    In the next example we have a workbook definition with all actions inside it, in this case we put in the example the first workflow with all the tasks involved.

    version: '2.0'
    name: tripleo.undercloud_backup.v1
    description: TripleO Undercloud backup workflows
        description: >
          This workflow will prepare the Undercloud to run the database backup
          - tripleo-common-managed
          - queue_name: tripleo
          # Action to know if there is enough available space
          # to run the Undercloud backup
            action: tripleo.undercloud.get_free_space
                status: <% task().result %>
                free_space: <% task().result %>
            on-success: send_message
            on-error: send_message
              status: FAILED
              message: <% task().result %>
          # Sending a message that the folder to create the backup was
          # created succesfully
            action: zaqar.queue_post
            retry: count=5 delay=1
              queue_name: <% $.queue_name %>
                  type: tripleo.undercloud_backup.v1.launch
                    status: <% $.status %>
                    execution: <% execution() %>
                    message: <% $.get('message', '') %>
              - fail: <% $.get('status') = "FAILED" %>

    The workflow its self explanatory, the only not so clear part might be the last one as the workflow uses an action to send a message stating that the workflow ended correctly. Passing as the message the output of the previous task, in this case the result of the create_backup_dir.

    5. Give support for new Mistral environment variables when installing the undercloud.

    Sometimes is needed to use additional values inside a Mistral task. For example, if we need to create a dump of a database we might need another that the Mistral user credentials for authentication purposes.

    Initially when the Undercloud is installed it’s created a Mistral environment called tripleo.undercloud-config. This environment variable will have all required configuration details that we can get from Mistral. This is defined in the instack-undercloud repository.

    Let’s get into the repository and check the content of the file instack_undercloud/

    This file defines a set of methods to interact with the Undercloud, specifically the method called _create_mistral_config_environment allows to configure additional environment variables when installing the Undercloud.

    For additional testing, you can use the Python snippet to call Mistral client from the Undercloud node available in

    6. Show how to test locally the changes in python-tripleoclient and tripleo-common.

    If it’s needed a local test of a change in python-tripleoclient or tripleo-common, the following procedures allow to test it locally.

    For a change in python-tripleoclient, assuming you already have downloaded the change you want to test, execute:

    cd python-tripleoclient
    sudo rm -Rf /usr/lib/python2.7/site-packages/tripleoclient*
    sudo rm -Rf /usr/lib/python2.7/site-packages/python_tripleoclient*
    sudo python clean --all install

    For a change in tripleo-common, assuming you already have downloaded the change you want to test, execute:

    cd tripleo-common
    sudo rm -Rf /usr/lib/python2.7/site-packages/tripleo_common*
    sudo python clean --all install
    sudo cp /usr/share/tripleo-common/sudoers /etc/sudoers.d/tripleo-common
    # this loads the actions via entrypoints
    sudo mistral-db-manage --config-file /etc/mistral/mistral.conf populate
    # make sure the new actions got loaded
    mistral action-list | grep tripleo
    for workbook in workbooks/*.yaml; do
        mistral workbook-create $workbook
    for workbook in workbooks/*.yaml; do
        mistral workbook-update $workbook
    sudo systemctl restart openstack-mistral-executor
    sudo systemctl restart openstack-mistral-engine

    If we want to execute a Mistral action or a Mistral workflow you can execute:

    Examples about how to test Mistral actions independently:

    mistral run-action tripleo.undercloud.get_free_space #Without parameters
    mistral run-action tripleo.undercloud.get_free_space '{"path": "/etc/"}' # With parameters
    mistral run-action tripleo.undercloud.create_file_system_backup '{"sources_path": "/tmp/asdf.txt,/tmp/asdf", "destination_path": "/tmp/"}'

    Examples about how to test a Mistral workflow independently:

    mistral execution-create tripleo.undercloud_backup.v1.prepare_environment # No parameters
    mistral execution-create tripleo.undercloud_backup.v1.filesystem_backup '{"sources_path": "/tmp/asdf.txt,/tmp/asdf", "destination_path": "/tmp/"}' # With parameters

    7. Give elevated privileges to specific Mistral actions that need to run with elevated privileges.

    Sometimes its is not possible to execute some restricted actions from the Mistral user, for example, when creating the Undercloud backup we won’t be able to access the /home/stack/ folder to create a tarball of it. For this cases it’s possible to execute elevates actions from the Mistral user:

    This is the content of the sudoers in the root of the tripleo-common repository at the time of the creatino of this guide.

    Defaults!/usr/bin/run-validation !requiretty
    Defaults:validations !requiretty
    Defaults:mistral !requiretty
    mistral ALL = (validations) NOPASSWD:SETENV: /usr/bin/run-validation
    mistral ALL = NOPASSWD: /usr/bin/chown -h validations\: /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
            /usr/bin/chown validations\: /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
            !/usr/bin/chown /tmp/validations_identity_* *, !/usr/bin/chown /tmp/validations_identity_*..*
    mistral ALL = NOPASSWD: /usr/bin/rm -f /tmp/validations_identity_[A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_][A-Za-z0-9_], \
            !/usr/bin/rm /tmp/validations_identity_* *, !/usr/bin/rm /tmp/validations_identity_*..*
    mistral ALL = NOPASSWD: /bin/nova-manage cell_v2 discover_hosts *
    mistral ALL = NOPASSWD: /usr/bin/tar --ignore-failed-read -C / -cf /tmp/undercloud-backup-*.tar *
    mistral ALL = NOPASSWD: /usr/bin/chown mistral. /tmp/undercloud-backup-*/filesystem-*.tar
    validations ALL = NOPASSWD: ALL

    Here you can grant permissions for specific tasks in when executing Mistral workflows from tripleo-common

    7. Debugging actions.

    Let’s assume the action is written, added to setup.cfg but not appeared. Firstly, check if action was added by sudo mistral-db-manage populate. Run

    mistral action-list -f value -c Name | grep -e '^tripleo.undercloud'

    If you don’t see your actions check output of sudo mistral-db-manage populate as

    sudo mistral-db-manage populate 2>&1| grep ERROR | less

    The following output may indicate issues in code. Simply fix code.

    2018-01-01:00:59.730 7218 ERROR stevedore.extension [-] Could not load 'tripleo.undercloud.get_free_space': unexpected indent (, line 40):   File "/usr/lib/python2.7/site-packages/tripleo_common/actions/", line 40

    Execute single action, execute workflow from workbook to make sure it works as designed.

    8. Unit tests

    Writing Unit test is essential instrument of Software Developer. Unit tests are much faster that running Workflow itself. So, let’s write unit tests for written action. Let’s add tripleo_common/tests/actions/ file with the following content in tripleo-comon repositiry.

    import mock
    from tripleo_common.actions import undercloud
    from tripleo_common.tests import base
    class GetFreeSpaceTest(base.TestCase):
        def setUp(self):
            super(GetFreeSpaceTest, self).setUp()
            self.temp_dir = "/tmp"
        def test_run_false(self, mock_statvfs, mock_isdir, mock_gettempdir):
            mock_gettempdir.return_value = self.temp_dir
            mock_isdir.return_value = True
            mock_statvfs.return_value = mock.MagicMock(
                spec_set=['f_frsize', 'f_bavail'],
                f_frsize=4096, f_bavail=1024)
            action = undercloud.GetFreeSpace()
            action_result ={})
            self.assertEqual("There is no enough space, avail. - 4 MB",
        def test_run_true(self, mock_statvfs, mock_isdir, mock_gettempdir):
            mock_gettempdir.return_value = self.temp_dir
            mock_isdir.return_value = True
            mock_statvfs.return_value = mock.MagicMock(
                spec_set=['f_frsize', 'f_bavail'],
                f_frsize=4096, f_bavail=10240000)
            action = undercloud.GetFreeSpace()
            action_result ={})
            self.assertEqual("There is enough space, avail. - 40000 MB",


    tox -epy27

    to see any unit tests errors.

    • Undercloud backups are an important step before runing an Upgrade.
    • Writing developer docs will help people to create and develope new features.

    9. References


    by Carlos Camacho at December 18, 2017 12:00 AM

    December 15, 2017

    RDO Blog

    Blog Round-up

    It's time for another round-up of the great content that's circulating our community. But before we jump in, if you know of an OpenStack or RDO-focused blog that isn't featured here, be sure to leave a comment below and we'll add it to the list.

    ICYMI, here's what has sparked the community's attention this month, from Ansible to TripleO, emoji-rendering, and more.

    TripleO and Ansible (Part 2) by slagle

    In my last post, I covered some of the details about using Ansible to deploy with TripleO. If you haven’t read that yet, I suggest starting there:


    TripleO and Ansible deployment (Part 1) by slagle

    In the Queens release of TripleO, you’ll be able to use Ansible to apply the software deployment and configuration of an Overcloud.


    An Introduction to Fernet tokens in Red Hat OpenStack Platform by Ken Savich, Senior OpenStack Solution Architect

    Thank you for joining me to talk about Fernet tokens. In this first of three posts on Fernet tokens, I’d like to go over the definition of OpenStack tokens, the different types and why Fernet tokens should matter to you. This series will conclude with some awesome examples of how to use Red Hat Ansible to manage your Fernet token keys in production.


    Full coverage of libvirt XML schemas achieved in libvirt-go-xml by Daniel Berrange

    In recent times I have been aggressively working to expand the coverage of libvirt XML schemas in the libvirt-go-xml project. Today this work has finally come to a conclusion, when I achieved what I believe to be effectively 100% coverage of all of the libvirt XML schemas. More on this later, but first some background on Go and XML…


    Full colour emojis in virtual machine names in Fedora 27 by Daniel Berrange

    Quite by chance today I discovered that Fedora 27 can display full colour glyphs for unicode characters that correspond to emojis, when the terminal displaying my mutt mail reader displayed someone’s name with a full colour glyph showing stars:


    Booting baremetal from a Cinder Volume in TripleO by higginsd

    Up until recently in TripleO booting, from a cinder volume was confined to virtual instances, but now thanks to some recent work in ironic, baremetal instances can also be booted backed by a cinder volume.


    by Mary Thengvall at December 15, 2017 06:20 AM

    December 14, 2017

    Red Hat Stack

    Red Hat OpenStack Platform 12 Is Here!

    We are happy to announce that Red Hat OpenStack Platform 12 is now Generally Available (GA).

    This is Red Hat OpenStack Platform’s 10th release and is based on the upstream OpenStack release, Pike.

    Red Hat OpenStack Platform 12 is focused on the operational aspects to deploying OpenStack. OpenStack has established itself as a solid technology choice and with this release, we are working hard to further improve the usability aspects and bring OpenStack and operators into harmony.

    Logotype_RH_OpenStackPlatform_RGB_Black (1)

    With operationalization in mind, let’s take a quick look at some the biggest and most exciting features now available.


    As containers are changing and improving IT operations it only stands to reason that OpenStack operators can also benefit from this important and useful technology concept. In Red Hat OpenStack Platform we have begun the work of containerizing the control plane. This includes some of the main services that run OpenStack, like Nova and Glance, as well as supporting technologies, such as Red Hat Ceph Storage. All these services can be deployed as containerized applications via Red Hat OpenStack Platform’s lifecycle and deployment tool, director.

    frank-mckenna-252014Photo by frank mckenna on Unsplash

    Bringing a containerized control plane to OpenStack is important. Through it we can immediately enhance, among other things, stability and security features through isolation. By design, OpenStack services often have complex, overlapping library dependencies that must be accounted for in every upgrade, rollback, and change. For example, if Glance needs a security patch that affects a library shared by Nova, time must be spent to ensure Nova can survive the change; or even more frustratingly, Nova may need to be updated itself. This makes the change effort and resulting change window and impact, much more challenging. Simply put, it’s an operational headache.

    However, when we isolate those dependencies into a container we are able to work with services with much more granularity and separation. An urgent upgrade to Glance can be done alongside Nova without affecting it in any way. With this granularity, operators can more easily quantify and test the changes helping to get them to production more quickly.

    We are working closely with our vendors, partners, and customers to move to this containerized approach in a way that is minimally disruptive. Upgrading from a non-containerized control plane to one with most services containerized is fully managed by Red Hat OpenStack Platform director. Indeed, when upgrading from Red Hat OpenStack Platform 11 to Red Hat OpenStack Platform 12 the entire move to containerized services is handled “under the hood” by director. With just a few simple preparatory steps director delivers the biggest change to OpenStack in years direct to your running deployment in an almost invisible, simple to run, upgrade. It’s really cool!

    Red Hat Ansible.

    Like containers, it’s pretty much impossible to work in operations and not be aware of, or more likely be actively using, Red Hat Ansible. Red Hat Ansible is known to be easier to use for customising and debugging; most operators are more comfortable with it, and it generally provides an overall nicer experience through a straightforward and easy to read format.


    Of course, we at Red Hat are excited to include Ansible as a member of our own family. With Red Hat Ansible we are actively integrating this important technology into more and more of our products.

    In Red Hat OpenStack Platform 12, Red Hat Ansible takes center stage.

    But first, let’s be clear, we have not dropped Heat; there are very real requirements around backward compatibility and operator familiarity that are delivered with the Heat template model.

    But we don’t have to compromise because of this requirement. With Ansible we are offering operator and developer access points independent of the Heat templates. We use the same composable services architecture as we had before; the Heat-level flexibility still works the same, we just translate to Ansible under the hood.

    Simplistically speaking, before Ansible, our deployments were mostly managed by Heat templates driving Puppet. Now, we use Heat to drive Ansible by default, and then Ansible drives Puppet and other deployment activities as needed. And with the addition of containerized services, we also have positioned Ansible as a key component of the entire container deployment. By adding a thin layer of Ansible, operators can now interact with a deployment in ways they could not previously.

    For instance, take the new openstack overcloud config download command. This command allows an operator to generate all the Ansible playbooks being used for a deployment into a local directory for review. And these aren’t mere interpretations of Heat actions, these are the actual, dynamically generated playbooks being run during the deployment. Combine this with Ansible’s cool dynamic inventory feature, which allows an operator to maintain their Ansible inventory file based on a real-time infrastructure query, and you get an incredibly powerful troubleshooting entry point.

    Check out this short (1:50) video showing Red Hat Ansible and this new exciting command and concept:

    Network composability.

    Another major new addition for operators is the extension of the composability concept into networks.

    As a reminder, when we speak about composability we are talking about enabling operators to create detailed solutions by giving them basic, simple, defined components from which they can build for their own unique, complex topologies.

    With composable networks, operators are no longer only limited to using the predefined networks provided by director. Instead, they can now create additional networks to suit their specific needs. For instance, they might create a network just for NFS filer traffic, or a dedicated SSH network for security reasons.

    radek-grzybowski-74331Photo by Radek Grzybowski on Unsplash

    And as expected, composable networks work with composable roles. Operators can create custom roles and apply multiple, custom networks to them as required. The combinations lead to an incredibly powerful way to build complex enterprise network topologies, including an on-ramp to the popular L3 spine-leaf topology.

    And to make it even easier to put together we have added automation in director that verifies that resources and Heat templates for each composable network are automatically generated for all roles. Fewer templates to edit can mean less time to deployment!

    Telco speed.

    Telcos will be excited to know we are now delivering production ready virtualized fast data path technologies. This release includes Open vSwitch 2.7 and the Data Plane Development Kit (DPDK) 16.11 along with improvements to Neutron and Nova allowing for robust virtualized deployments that include support for large MTU sizing (i.e. jumbo frames) and multiple queues per interface. OVS+DPDK is now a viable option alongside SR-IOV and PCI passthrough in offering more choice for fast data in Infrastructure-as-a-Service (IaaS) solutions.

    Operators will be pleased to see that these new features can be more easily deployed thanks to new capabilities within Ironic, which store environmental parameters during introspection. These values are then available to the overcloud deployment providing an accurate view of hardware for ideal tuning. Indeed, operators can further reduce the complexity around tuning NFV deployments by allowing director to use the collected values to dynamically derive the correct parameters resulting in truly dynamic, optimized tuning.

    Serious about security.


    Helping operators, and the companies they work for, focus on delivering business value instead of worrying about their infrastructure is core to Red Hat’s thinking. And one way we make sure everyone sleeps better at night with OpenStack is through a dedicated focus on security.

    Starting with Red Hat OpenStack Platform 12 we have more internal services using encryption than in any previous release. This is an important step for OpenStack as a community to help increase adoption in enterprise datacenters, and we are proud to be squarely at the center of that effort. For instance, in this release even more services now feature internal TLS encryption.

    Let’s be realistic, though, focusing on security extends beyond just technical implementation. Starting with Red Hat OpenStack Platform 12 we are also releasing a comprehensive security guide, which provides best practices as well as conceptual information on how to make an OpenStack cloud more secure. Our security stance is firmly rooted in meeting global standards from top international agencies such as FedRAMP (USA), ETSI (Europe), and ANSSI (France). With this guide, we are excited to share these efforts with the broader community.

    Do you even test?

    How many times has someone asked an operations person this question? Too many! “Of course we test,” they will say. And with Red Hat OpenStack Platform 12 we’ve decided to make sure the world knows we do, too.

    Through the concept of Distributed Continuous Integration (DCI), we place remote agents on site with customers, partners, and vendors that continuously build our releases at all different stages on all different architectures. By engaging outside resources we are not limited by internal resource restrictions; instead, we gain access to hardware and architecture that could never be tested in any one company’s QA department. With DCI we can fully test our releases to see how they work under an ever-increasing set of environments. We are currently partnered with major industry vendors for this program and are very excited about how it helps us make the entire OpenStack ecosystem better for our customers.

    So, do we even test? Oh, you bet we do!

    Feel the love!

    grafxart-photo-420180Photo by grafxart photo on Unsplash

    And this is just a small piece of the latest Red Hat OpenStack Platform 12 release. Whether you are looking to try out a new cloud, or thinking about an upgrade, this release brings a level of operational maturity that will really impress!

    Now that OpenStack has proven itself an excellent choice for IaaS, it can focus on making itself a loveable one.

    Let Red Hat OpenStack Platform 12 reignite the romance between you and your cloud!

    Red Hat OpenStack Platform 12 is designated as a “Standard” release with a one-year support window. Click here for more details on the release lifecycle for Red Hat OpenStack Platform.

    Find out more about this release at the Red Hat OpenStack Platform Product page. Or visit our vast online documentation.

    And if you’re ready to get started now, check out the free 60-day evaluation available on the Red Hat portal.

    Looking for even more? Contact your local Red Hat office today.


    by August Simonelli, Technical Marketing Manager, Cloud at December 14, 2017 01:49 AM

    December 13, 2017

    James Slagle

    TripleO and Ansible (Part 2)

    In my last post, I covered some of the details about using Ansible to deploy
    with TripleO. If you haven’t read that yet, I suggest starting there:

    I’ll now cover interacting with Ansible more directly.

    When using --config-download as a deployment argument, a Mistral workflow will be enabled that runs ansible-playbook to apply the deployment and configuration data to each node. When the deployment is complete, you can interact with the files that were created by the workflow.

    Let’s take a look at how to do that.

    You need to have a shell on the Undercloud. Since the files used by the workflow potentially contain sensitive data, they are only readable by the mistral user or group. So either become the root user, or add your interactive shell user account (typically “stack”) to the mistral group:

    sudo usermod -a -G mistral stack
    # Activate the new group
    newgrp mistral

    Once the permissions are sorted, change to the mistral working directory for
    the config-download workflows:

    cd /var/lib/mistral

    Within that directory, there will be directories named according to the Mistral
    execution uuid. An easy way to find the most recent execution of
    config-download is to just cd into the most recently created directory and list
    the files in that directory:

    cd 2747b55e-a7b7-4036-82f7-62f09c63d671

    The following files (or a similar set, as things could change) will exist:


    All the files that are needed to re-run ansible-playbook are present. The exact ansible-playbook command is saved in Let’s take a look at that file:

    $ cat
    ansible-playbook -v /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/deploy_steps_playbook.yaml --user tripleo-admin --become --ssh-extra-args "-o StrictHostKeyChecking=no" --timeout 240 --inventory-file /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/tripleo-ansible-inventory --private-key /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/ssh_private_key $@

    You can see how the call to ansible-playbook is reproduced in this script. Also notice that $@ is used to pass any additional arguments directly to ansible-playbook when calling this script, such as --check, --limit, --tags, --start-at-task, etc.

    Some of the other files present are:

    • tripleo-ansible-inventory
      • Ansible inventory file containing hosts and vars for all the Overcloud nodes.
    • ansible.log
      • Log file from the last run of ansible-playbook.
    • ansible.cfg
      • Config file used when running ansible-playbook.
      • Executable script that can be used to rerun ansible-playbook.
    • ssh_private_key
      • Private ssh key used to ssh to the Overcloud nodes.

    Within the group_vars directory, there is a corresponding file per role. In my
    example, I have a Controller role. If we take a look at group_vars/Controller we see it contains:

    $ cat group_vars/Controller
    - HostsEntryDeployment
    - DeployedServerBootstrapDeployment
    - UpgradeInitDeployment
    - InstanceIdDeployment
    - NetworkDeployment
    - ControllerUpgradeInitDeployment
    - UpdateDeployment
    - ControllerDeployment
    - SshHostPubKeyDeployment
    - ControllerSshKnownHostsDeployment
    - ControllerHostsDeployment
    - ControllerAllNodesDeployment
    - ControllerAllNodesValidationDeployment
    - ControllerArtifactsDeploy
    - ControllerHostPrepDeployment
    Controller_post_deployments: []

    The <RoleName>_pre_deployments and <RoleName>_post_deployments variables contain the list of Heat deployment names to run for that role. Suppose we wanted to just rerun a single deployment. That command would be:

    $ ./ --tags pre_deploy_steps -e Controller_pre_deployments=ControllerArtifactsDeploy -e force=true

    That would run just the ControllerArtifactsDeploy deployment. Passing -e force=true is necessary to force the deployment to rerun. Also notice we restrict what tags get run with --tags pre_deploy_steps.

    For documentation on what tags are available see:

    Finally, suppose we wanted to just run the 5 deployment steps that are the same for all nodes of a given role. We can use --limit <RoleName>, as the role names are defined as groups in the inventory file. That command would be:

    $ ./ --tags deploy_steps --limit Controller

    I hope this info is helpful. Let me know what you want to see next.




    Cross posted at:


    by slagle at December 13, 2017 01:12 PM

    December 11, 2017

    Red Hat Stack

    Enabling Keystone’s Fernet Tokens in Red Hat OpenStack Platform

    As we learned in part one of this blog post, beginning with the OpenStack Kilo release, a new token provider is now available as an alternative to PKI and UUID. Fernet tokens are essentially an implementation of ephemeral tokens in Keystone. What this means is that tokens are no longer persisted and hence do not need to be replicated across clusters or regions.

    “In short, OpenStack’s authentication and authorization metadata is neatly bundled into a MessagePacked payload, which is then encrypted and signed as a Fernet token. OpenStack Kilo’s implementation supports a three-phase key rotation model that requires zero downtime in a clustered environment.” (from:

    In our previous post, I covered the different types of tokens, the benefits of Fernet and a little bit of the technical details. In this part of our three part series we provide a method for enabling Fernet tokens on Red Hat OpenStack Platform Platform 10, during both pre and post deployment of the overcloud stack.

    Pre-Overcloud Deployment

    Official Red Hat documentation for enabling Fernet tokens in the overcloud can be found here:

    Deploy Fernet on the Overcloud


    We’ll be using the Red Hat OpenStack Platform here, so this means we’ll be interacting with the director node and heat templates. Our primary tool is the command-line client keystone-manage, part of the tools provided by the openstack-keystone RPM and used to set up and manage keystone in the overcloud. Of course, we’ll be using the director-based deployment of Red Hat’s OpenStack Platform to enable Fernet pre and/or post deployment.

    barn-images-12223Photo by Barn Images on Unsplash

    Prepare Fernet keys on the undercloud

    This procedure will start with preparation of the Fernet keys, which a default  deployment places on each controller in /etc/keystone/fernet-keys. Each controller must have the same keys, as tokens issued on one controller must be able to be validated on all controllers. Stay tuned to part three of this blog for an in-depth explanation of Fernet signing keys.

    1. Source the stackrc file to ensure we are working with the undercloud:
    $ source ~/stackrc‍‍‍‍‍‍‍‍‍‍‍‍
    1. From your director, use keystone_manage to generate the Fernet keys as deployment artifacts:
    $ sudo keystone-manage fernet_setup \
        --keystone-user keystone \
        --keystone-group keystone‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
    1. Tar up the keys for upload into a swift container on the undercloud:
    $ sudo tar -zcf keystone-fernet-keys.tar.gz /etc/keystone/fernet-keys‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
    1. Upload the Fernet keys to the undercloud as swift artifacts (we assume your templates exist in ~/templates):
    $ upload-swift-artifacts -f keystone-fernet-keys.tar.gz \
        --environment ~/templates/deployment-artifacts.yaml‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
    1. Verify that your artifact exists in the undercloud:
    $ swift list overcloud-artifacts Keystone-fernet-keys.tar.gz

    NOTE: These keys should be secured as they can be used to sign and validate tokens that will have access to your cloud.‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

    1. Let’s verify that deployment-artifacts.yaml exists in ~/templates (NOTE: your URL detail will differ from what you see here – as this is a uniquely generated temporary URL):
    $ cat ~/templates/deployment-artifacts.yaml
    # Heat environment to deploy artifacts via Swift Temp URL(s)
        - '

    NOTE: This is the swift URL that your overcloud deployment will use to copy the Fernet keys to your controllers.

    1. Finally, generate the fernet.yaml template to enable Fernet as the default token provider in your overcloud:
    $ cat << EOF > ~/templates/fernet.yaml
                keystone::token_provider: 'fernet'‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

    Deploy and Validate

    At this point, you are ready to deploy your overcloud with Fernet enabled as the token provider, and your keys distributed to each controller in /etc/keystone/fernet-keys.

    glenn-carstens-peters-190592Photo by Glenn Carstens-Peters on Unsplash

    NOTE: This is an example deploy command, yours will likely include many more templates. For the purposes of our discussion, it is important that you simply include fernet.yaml as well as deployment-artifacts.yaml.

    $ openstack overcloud deploy \
    --templates /home/stack/templates \
    -e  /home/stack/templates/environments/deployment-artifacts.yaml \
    -e /home/stack/templates/environments/fernet.yaml \
    --control-scale 3 \
    --compute-scale 4 \
    --control-flavor control \
    --compute-flavor compute \


    Once the deployment is done you should validate that your overcloud is indeed using Fernet tokens instead of the default UUID token provider. From the director node:

    $ source ~/overcloudrc
    $ openstack token issue
    | Field      | Value                                    |
    | expires    | 2017-03-22 19:16:21+00:00                |
    | id | gAAAAABY0r91iYvMFQtGiRRqgMvetAF5spEZPTvEzCpFWr3  |
    |    | 1IB8T8L1MRgf4NlOB6JsfFhhdxenSFob_0vEEHLTT6rs3Rw  |
    |    | q3-Zm8stCF7sTIlmBVms9CUlwANZOQ4lRMSQ6nTfEPM57kX  |
    |    | Xw8GBGouWDz8hqDYAeYQCIHtHDWH5BbVs_yC8ICXBk       |
    | project_id | f8adc9dea5884d23a30ccbd486fcf4c6         |
    | user_id    | 2f6106cef80741c6ae2bfb3f25d70eee         |

    Note the length of this token in the “id” field. This is a Fernet token.

    Enabling Fernet Post Overcloud Deployment

    Part of the power of the Red Hat OpenStack Platform director deployment methodology lies in its ability to easily upgrade and change a running overcloud. Features such as Fernet, scaling, and complex service management, can be managed by running a deployment update directly against a running overcloud.

    Updating is really straightforward. If you’ve already deployed your overcloud with UUID tokens you can change them to Fernet by simply following the pre-deploy example above and run the openstack deploy command again, with the enabled heat templates mentioned, against your running deployment! This will change your overcloud token default to Fernet. Be sure to deploy with your original deploy command, as any changes there could affect your overcloud. And of course, standard outage windows apply – production changes should be tested and prepared accordingly.


    I hope you’ve enjoyed our discussion on enabling Fernet tokens in the overcloud. Additionally, I hope that I was able to shed some light on this process as well. Official documentation on these concepts and Fernet tokens in the overcloud process is available

    In our last, and final instalment on this topic we’ll look at some of the many methods for rotating your newly enabled Fernet keys on your controller nodes. We’ll be using Red Hat’s awesome IT automation tool, Red Hat Ansible to do just that.

    by Ken Savich, Senior OpenStack Solution Architect at December 11, 2017 08:59 PM

    James Slagle

    TripleO and Ansible deployment (Part 1)

    In the Queens release of TripleO, you’ll be able to use Ansible to apply the
    software deployment and configuration of an Overcloud.

    Before jumping into some of the technical details, I wanted to cover some
    background about how the Ansible integration works along side some of the
    existing tools in TripleO.

    The Ansible integration goes as far as offering an alternative to the
    communication between the existing Heat agent (os-collect-config) and the Heat
    API. This alternative is opt-in for Queens, but we are exploring making it the
    default behavior for future releases.

    The default behavior for Queens (and all prior releases) will still use the
    model where each Overcloud node has a daemon agent called os-collect-config
    that periodically polls the Heat API for deployment and configuration data.
    When Heat provides updated data, the agent applies the deployments, making
    changes to the local node such as configuration, service management,
    pulling/starting containers, etc.

    The Ansible alternative instead uses a “control” node (the Undercloud) running
    ansible-playbook with a local inventory file and pushes out all the changes to
    each Overcloud node via ssh in the typical Ansible fashion.

    Heat is still the primary API, while the parameter and environment files that
    get passed to Heat to create an Overcloud stack remain the same regardless of
    which method is used.

    Heat is also still fully responsible for creating and orchestrating all
    OpenStack resources in the services running on the Undercloud (Nova servers,
    Neutron networks, etc).

    This sequence diagram will hopefully provide a clear picture:

    Replacing the application and transport layer of the deployment with Ansible
    allows us to take advantage of features in Ansible that will hopefully make
    deploying and troubleshooting TripleO easier:

    • Running only specific deployments
    • Including/excluding specific nodes or roles from an update
    • More real time sequential output of the deployment
    • More robust error reporting
    • Faster iteration and reproduction of deployments

    Using Ansible instead of the Heat agent is easy. Just include 2 extra cli args
    in the deployment command:

    -e /path/to/templates/environments/config-download-environment.yaml \

    Once Heat is done creating the stack (will be much faster than usual), a
    separate Mistral workflow will be triggered that runs ansible-playbook to
    finish the deployment. The output from ansible-playbook will be streamed to
    stdout so you can follow along with the progress.

    Here’s a demo showing what a stack update looks like:

    (I suggest making the demo fully screen or watch it here:

    Note that we don’t get color output from ansible-playbook since we are
    consuming the stdout from a Zaqar queue. However, in my next post I will go
    into how to execute ansible-playbook manually, and detail all of the related
    files (inventory, playbooks, etc) that are available to interact with manually.

    If you want to read ahead, have a look at the official documentation:


    The infrastructure that hosts this blog may go away soon. In which case I’m
    also cross posting to:


    by slagle at December 11, 2017 03:14 PM

    December 08, 2017

    RDO Blog

    Gate repositories on Github with Software Factory and Zuul3


    Software Factory is an easy to deploy software development forge. It provides, among others features, code review and continuous integration (CI). The latest Software Factory release features Zuul V3 that provides integration with Github.

    In this blog post I will explain how to configure a Software Factory instance, so that you can experiment with gating Github repositories with Zuul.

    First we will setup a Github application to define the Software Factory instance as a third party application and we will configure this instance to act as a CI system for Github.

    Secondly, we will prepare a Github test repository by:

    • Installing the application on it
    • configuring its master branch protection policy
    • providing Zuul job description files

    Finally, we will configure the Software Factory instance to test and gate Pull Requests for this repository, and we will validate this CI by opening a first Pull Request on the test repository.

    Note that Zuul V3 is not yet released upstream however it is already in production, acting as the CI system of OpenStack.


    A Software Factory instance is required to execute the instructions given in this blog post. If you need an instance, you can follow the quick deployment guide in this previous article. Make sure the instance has a public IP address and TCP/443 is open so that Github can reach Software Factory via HTTPS.

    Application creation and Software Factory configuration

    Let's create a Github application named myorg-zuulapp and register it on the instance. To do so, follow this section from Software Factory's documentation.

    But make sure to:

    • Replace fqdn in the instructions by the public IP address of your Software Factory instance. Indeed the default hostname won't be resolved by Github.
    • Check "Disable SSL verification" as the Software Factory instance is by default configured with a self-signed certificate.
    • Check "Only on this account" for the question "Where can this Github app be installed".

    Configuration of the app part 1 Configuration of the app part 2 Configuration of the app part 3

    After adding the github app settings in /etc/software-factory/sfconfig.yaml, run:

    sudo sfconfig --enable-insecure-slaves --disable-fqdn-redirection

    Finally, make sure can contact the Software Factory instance by clicking on "Redeliver" in the advanced tab of the application. Having the green tick is the pre-requisite to go further. If you cannot get it, the rest of the article will not be able to be accomplished successfuly.

    Configuration of the app part 4

    Define Zuul3 specific Github pipelines

    On the Software Factory instance, as root, create the file config/zuul.d/gh_pipelines.yaml.

    cd /root/config
    cat <<EOF > zuul.d/gh_pipelines.yaml
    - pipeline:
        description: |
          Newly uploaded patchsets enter this pipeline to receive an
          initial +/-1 Verified vote.
        manager: independent

            - event: pull_request
                - opened
                - changed
                - reopened
            - event: pull_request
              action: comment
              comment: (?i)^\s*recheck\s*$

            status: 'pending'
            status-url: "{}/status.html"
            comment: false

            status: 'success'

            status: 'failure'
    - pipeline:
        description: |
          Changes that have been approved by core developers are enqueued
          in order in this pipeline, and if they pass tests, will be
        success-message: Build succeeded (gate pipeline).
        failure-message: Build failed (gate pipeline).
        manager: dependent
        precedence: high

              - permission: write
            status: "myorg-zuulapp[bot]:local/"
            open: True
            current-patchset: True

            - event: pull_request_review
              action: submitted
              state: approved
            - event: pull_request
              action: status
              status: "myorg-zuulapp[bot]:local/"

            status: 'pending'
            status-url: "{}/status.html"
            comment: false

            status: 'success'
            merge: true

            status: 'failure'
    sed -i s/myorg/myorgname/ zuul.d/gh_pipelines.yaml

    Make sure to replace "myorgname" by the organization name.

    git add -A .
    git commit -m"Add pipelines"
    git push git+ssh://gerrit/config master

    Setup a test repository on Github

    Create a repository called ztestrepo, initialize it with an empty

    Install the Github application

    Then follow the process below to add the application myorg-zuulapp to ztestrepo.

    1. Visit your application page, e.g.:
    2. Click “Install”
    3. Select ztestrepo to install the application on
    4. Click “Install”

    Then you should be redirected on the application setup page. This can be safely ignored for the moment.

    Define master branch protection

    We will setup the branch protection policy for the master branch of ztestrepo. We want a Pull Request to have, at least, one code review approval and all CI checks passed with success before a PR become mergeable.

    You will see, later in this article, that the final job run and the merging phase of the Pull Request are ensured by Zuul.

    1. Go to
    2. Choose the master branch
    3. Check "Protect this branch"
    4. Check "Require pull request reviews before merging"
    5. Check "Dismiss stale pull request approvals when new commits are pushed"
    6. Check "Require status checks to pass before merging"
    7. Click "Save changes"

    Attach the application

    Add a collaborator

    A second account on Github is needed to act as collaborator of the repository ztestrepo. Select one in This collaborator will act as the PR reviewer later in this article.

    Define a Zuul job

    Create the file .zuul.yaml at the root of ztestrepo.

    git clone
    cd ztestrepo
    cat <<EOF > .zuul.yaml
    - job:
        name: myjob-noop
        parent: base
        description: This a noop job
        run: playbooks/noop.yaml
            - name: test-node
              label: centos-oci
    - project:
        name: myorg/ztestrepo
            - myjob-noop
            - myjob-noop
    sed -i s/myorg/myorgname/ .zuul.yaml

    Make sure to replace "myorgname" by the organization name.

    Create playbooks/noop.yaml.

    mkdir playbooks
    cat <<EOF > playbooks/noop.yaml
    - hosts: test-node
        - name: Success
          command: "true"

    Push the changes directly on the master branch of ztestrepo.

    git add -A .
    git commit -m"Add zuulv3 job definition"
    git push origin master

    Register the repository on Zuul

    At this point, the Software Factory instance is ready to receive events from Github and the Github repository is properly configured. Now we will tell Software Factory to consider events for the repository.

    On the Software Factory instance, as root, create the file myorg.yaml.

    cd /root/config
    cat <<EOF > zuulV3/myorg.yaml
    - tenant:
        name: 'local'

              - myorg/ztestrepo
    sed -i s/myorg/myorgname/ zuulV3/myorg.yaml

    Make sure to replace "myorgname" by the organization name.

    git add zuulV3/myorg.yaml && git commit -m"Add ztestrepo to zuul" && git push git+ssh://gerrit/config master

    Create a Pull Request and see Zuul in action

    1. Create a Pull Request via the Github UI
    2. Wait the for pipeline to finish with success

    Check test

    1. Ask the collaborator to set his approval on the Pull request


    1. Wait for Zuul to detect the approval
    2. Wait the for pipeline to finish with success

    Gate test

    1. Wait for for the Pull Request to be merged by Zuul


    As you can see, after the run of the check job and the reviewer's approval, Zuul has detected that the state of the Pull Request was ready to enter the gating pipeline. During the gate run, Zuul has executed the job against the Pull Request code change rebased on the current master then made Github merge the Pull Request as the job ended with a success.

    Other powerful Zuul features such as cross-repository testing or Pull Request dependencies between repositories are supported but beyond the scope of this article. Do not hesitate to refer to the upstream documentation to learn more about Zuul.

    Next steps to go further

    To learn more about Software Factory please refer to the upstream documentation. You can reach the Software Factory team on IRC freenode channel #softwarefactory or by email at the mailing list.

    by fboucher at December 08, 2017 02:13 PM

    December 07, 2017

    Red Hat Stack

    An Introduction to Fernet tokens in Red Hat OpenStack Platform

    Thank you for joining me to talk about Fernet tokens. In this first of three posts on Fernet tokens, I’d like to go over the definition of OpenStack tokens, the different types and why Fernet tokens should matter to you. This series will conclude with some awesome examples of how to use Red Hat Ansible to manage your Fernet token keys in production.

    First, some definitions …

    What is a token? OpenStack tokens are bearer tokens, used to authenticate and validate users and processes in your OpenStack environment. Pretty much any time anything happens in OpenStack a token is involved. The OpenStack Keystone service is the core service that issues and validates tokens. Using these tokens, users and and software clients via API’s authenticate, receive, and finally use that token when requesting operations ranging from creating compute resources to allocating storage. Services like Nova or Ceph then validate that token with Keystone and continue on with or deny the requested operation. The following diagram, shows a simplified version of this dance.

    Screen Shot 2017-12-05 at 12.06.02 pmCourtesy of the author

    Token Types

    Tokens come in several types, referred to as “token providers” in Keystone parlance. These types can be set at deployment time, or changed post deployment. Ultimately, you’ll have to decide what works best for your environment, given your organization’s workload in the cloud.

    The following types of tokens exist in Keystone:

    UUID (Universal Unique Identifier)

    The default token provider in Keystone is UUID. This is a 32-byte bearer token that must be persisted (stored) across controller nodes, along with their associated metadata, in order to be validated.

    PKI & PKIZ (public key infrastructure)

    This token format is deprecated as of the OpenStack Ocata release, which means it is deprecated in Red Hat OpenStack Platform 11. This format is also persisted across controller nodes. PKI tokens contain catalog information of the user that bears them, and thus can get quite large, depending on how large your cloud is. PKIZ tokens are simply compressed versions of PKI tokens.


    Fernet tokens (pronounced fehr:NET) are message packed tokens that contain authentication and authorization data. Fernet tokens are signed and encrypted before being handed out to users. Most importantly, however, Fernet tokens are ephemeral. This means they do not need to be persisted across clustered systems in order to successfully be validated.

    Fernet was originally a secure messaging format created by Heroku. The OpenStack implementation of this lightweight and more API-friendly format was developed by the OpenStack Keystone core team.

    The Problem

    As you may have guessed by now, the real problem solved by Fernet tokens is one of persistence. Imagine, if you will, the following scenario:

    1. A user logs into Horizon (the OpenStack Dashboard)
    2. User creates a compute instance
    3. User requests persistent storage upon instance creation
    4. User assigns a floating IP to the instance

    While this is a simplified scenario, you can clearly see that there are multiple calls to different core components being made. In even the most basic of examples  you see at least one authentication, as well as multiple validations along the way. Not only does this require network bandwidth, but when using persistent token providers such as UUID it also requires a lot of storage in Keystone. Additionally, the token table in the database

    eugenio-mazzone-190204Photo by Eugenio Mazzone on Unsplash

    used by  Keystone grows as your cloud gets more usage. When using UUID tokens, operators must implement a detailed and comprehensive strategy to prune this table at periodic intervals to avoid real trouble down the line. This becomes even more difficult in a clustered environment.

    It’s not only backend components which are affected. In fact, all services that are exposed to users require authentication and authorization. This leads to increased bandwidth and storage usage on one of the most critical core components in OpenStack. If Keystone goes down, your users will know it and you no longer have a cloud in any sense of the word.

    Now imagine the impact as you scale your cloud;  the  problems with UUID tokens are dangerously amplified.

    Benefits of Fernet tokens

    Because Fernet tokens are ephemeral, you have the following immediate benefits:

    • Tokens do not need to be replicated to other instances of Keystone in your controller cluster
    • Storage is not affected, as these tokens are not stored

    The end-result offers increased performance overall. This was the design imperative of Fernet tokens, and the OpenStack community has more than delivered.  

    Show me the numbers

    All of these benefits sound good, but what are the real numbers behind the performance differences between UUID and Fernet? One of the core keystone developers, Dolph Matthews, created a great post about Fernet benchmarks.

    Note that these benchmarks are for OpenStack Kilo, so you’ll most likely see even greater performance numbers in newer releases.

    The most important benchmarks in Dolph’s post are the ones comparing the various token formats to each other on a globally-distributed Galera cluster. These show the following results using UUID as a baseline:

    Token creation performance

    Fernet 50.8 ms (85% faster than UUID) 237.1 (42% faster than UUID)

    Token validation performance

    Fernet 5.55 ms (8% faster than UUID) 1957.8 (14% faster then UUID)

    As you can see, these numbers are quite remarkable. More informal benchmarks can be found at the Cern OpenStack blog,
    OpenStack in Production.

    Security Implications

    praveesh-palakeel-352584Photo by Praveesh Palakeel on Unsplash

    One important aspect of using Fernet tokens is security. As these tokens are signed and encrypted, they are inherently more secure than plain text UUID tokens. One really great aspect of this is the fact that you can invalidate a large number of tokens, either during normal operations or during a security incident, by simply changing the keys used to validate them. This requires a key rotation strategy, which I’ll get into in the third part of this series.

    While there are security advantages to Fernet tokens, it must be said they are only as secure as the keys that created them. Keystone creates the tokens with a set of keys in your Red Hat OpenStack Platform environment. Using advanced technologies like SELinux, Red Hat Enterprise Linux is a trusted partner in this equation. Remember, the OS matters.


    While OpenStack functions just fine with its default UUID token format, I hope that this article shows you some of the benefits of Fernet tokens. I also hope that you find the knowledge you’ve gained here to be useful, once you decide to move forward to implementing them.

    In our follow-up blog post in this series, we’ll be looking at how to enable Fernet tokens in your OpenStack environment — both pre and post-deploy. Finally, our last post will show you how to automate key rotation using Red Hat Ansible in a production environment. I hope you’ll join me along the way.

    by Ken Savich, Senior OpenStack Solution Architect at December 07, 2017 07:06 PM

    Daniel Berrange

    Full coverage of libvirt XML schemas achieved in libvirt-go-xml

    In recent times I have been aggressively working to expand the coverage of libvirt XML schemas in the libvirt-go-xml project. Today this work has finally come to a conclusion, when I achieved what I believe to be effectively 100% coverage of all of the libvirt XML schemas. More on this later, but first some background on Go and XML….

    For those who aren’t familiar with Go, the core library’s encoding/xml module provides a very easy way to consume and produce XML documents in Go code. You simply define a set of struct types and annotate their fields to indicate what elements & attributes each should map to. For example, given the Go structs:

    type Person struct {
        XMLName xml.Name `xml:"person"`
        Name string `xml:"name,attr"`
        Age string `xml:"age,attr"` 
        Home *Address `xml:"home"`
        Office *Address `xml:"office"`
    type Address struct { 
        Street string `xml:"street"`
        City string `xml:"city"` 

    You can parse/format XML documents looking like

    <person name="Joe Blogs" age="24">
        <street>Some where</street><city>London</city>
        <street>Some where else</street><city>London</city>

    Other programming languages I’ve used required a great deal more work when dealing with XML. For parsing, there’s typically a choice between an XML stream based parser where you have to react to tokens as they’re parsed and stuff them into structs, or a DOM object hierarchy from which you then have to pull data out into your structs. For outputting XML, apps either build up a DOM object hierarchy again, or dynamically format the XML document incrementally. Whichever approach is taken, it generally involves writing alot of tedious & error prone boilerplate code. In most cases, the Go encoding/xml module eliminates all the boilerplate code, only requiring the data type defintions. This really makes dealing with XML a much more enjoyable experience, because you effectively don’t deal with XML at all! There are some exceptions to this though, as the simple annotations can’t capture every nuance of many XML documents. For example, integer values are always parsed & formatted in base 10, so extra work is needed for base 16. There’s also no concept of unions in Go, or the XML annotations. In these edge cases custom marshaling / unmarshalling methods need to be written. BTW, this approach to XML is also taken for other serialization formats including JSON and YAML too, with one struct field able to have many annotations so it can be serialized to a range of formats.

    Back to the point of the blog post, when I first started writing Go code using libvirt it was immediately obvious that everyone using libvirt from Go would end up re-inventing the wheel for XML handling. Thus about 1 year ago, I created the libvirt-go-xml project whose goal is to define a set of structs that can handle documents in every libvirt public XML schema. Initially the level of coverage was fairly light, and over the past year 18 different contributors have sent patches to expand the XML coverage in areas that their respective applications touched. It was clear, however, that taking an incremental approach would mean that libvirt-go-xml is forever trailing what libvirt itself supports. It needed an aggressive push to achieve 100% coverage of the XML schemas, or as near as practically identifiable.

    Alongside each set of structs we had also been writing unit tests with a set of structs populated with data, and a corresponding expected XML document. The idea for writing the tests was that the author would copy a snippet of XML from a known good source, and then populate the structs that would generate this XML. In retrospect this was not a scalable approach, because there is an enourmous range of XML documents that libvirt supports. A further complexity is that Go doesn’t generate XML documents in the exact same manner. For example, it never generates self-closing tags, instead always outputting a full opening & closing pair. This is semantically equivalent, but makes a plain string comparison of two XML documents impractical in the general case.

    Considering the need to expand the XML coverage, and provide a more scalable testing approach, I decided to change approach. The libvirt.git tests/ directory currently contains 2739 XML documents that are used to validate libvirt’s own native XML parsing & formatting code. There is no better data set to use for validating the libvirt-go-xml coverage than this. Thus I decided to apply a round-trip testing methodology. The libvirt-go-xml code would be used to parse the sample XML document from libvirt.git, and then immediately serialize them back into a new XML document. Both the original and new XML documents would then be parsed generically to form a DOM hierarchy which can be compared for equivalence. Any place where documents differ would cause the test to fail and print details of where the problem is. For example:

    $ go test -tags xmlroundtrip
    --- FAIL: TestRoundTrip (1.01s)
    	xml_test.go:384: testdata/libvirt/tests/vircaps2xmldata/vircaps-aarch64-basic.xml: \
                /capabilities[0]/host[0]/topology[0]/cells[0]/cell[0]/pages[0]: \
                element in expected XML missing in actual XML

    This shows the filename that failed to correctly roundtrip, and the position within the XML tree that didn’t match. Here the NUMA cell topology has a ‘<pages>‘  element expected but not present in the newly generated XML. Now it was simply a matter of running the roundtrip test over & over & over & over & over & over & over……….& over & over & over, adding structs / fields for each omission that the test identified.

    After doing this for some time, libvirt-go-xml now has 586 structs defined containing 1816 fields, and has certified 100% coverage of all libvirt public XML schemas. Of course when I say 100% coverage, this is probably a lie, as I’m blindly assuming that the libvirt.git test suite has 100% coverage of all its own XML schemas. This is certainly a goal, but I’m confident there are cases where libvirt itself is missing test coverage. So if any omissions are identified in libvirt-go-xml, these are likely omissions in libvirt’s own testing.

    On top of this, the XML roundtrip test is set to run in the libvirt jenkins and travis CI systems, so as libvirt extends its XML schemas, we’ll get build failures in libvirt-go-xml and thus know to add support there to keep up.

    In expanding the coverage of XML schemas, a number of non-trivial changes were made to existing structs  defined by libvirt-go-xml. These were mostly in places where we have to handle a union concept defined by libvirt. Typically with libvirt an element will have a “type” attribute, whose value then determines what child elements are permitted. Previously we had been defining a single struct, whose fields represented all possible children across all the permitted type values. This did not scale well and gave the developer no clue what content is valid for each type value. In the new approach, for each distinct type attribute value, we now define a distinct Go struct to hold the contents. This will cause API breakage for apps already using libvirt-go-xml, but on balance it is worth it get a better structure over the long term. There were also cases where a child XML element previously represented a single value and this was mapped to a scalar struct field. Libvirt then added one or more attributes on this element, meaning the scalar struct field had to turn into a struct field that points to another struct. These kind of changes are unavoidable in any nice manner, so while we endeavour not to gratuitously change currently structs, if the libvirt XML schema gains new content, it might trigger further changes in the libvirt-go-xml structs that are not 100% backwards compatible.

    Since we are now tracking libvirt.git XML schemas, going forward we’ll probably add tags in the libvirt-go-xml repo that correspond to each libvirt release. So for app developers we’ll encourage use of Go vendoring to pull in a precise version of libvirt-go-xml instead of blindly tracking master all the time.

    by Daniel Berrange at December 07, 2017 02:14 PM

    December 01, 2017

    Daniel Berrange

    Full colour emojis in virtual machine names in Fedora 27

    Quite by chance today I discovered that Fedora 27 can display full colour glyphs for unicode characters that correspond to emojis, when the terminal displaying my mutt mail reader displayed someone’s name with a full colour glyph showing stars:

    Mutt in GNOME terminal rendering color emojis in sender name

    Chatting with David Gilbert on IRC I learnt that this is a new feature in Fedora 27 GNOME, thanks to recent work in the GTK/Pango stack. David then pointed out this works in libvirt, so I thought I would illustrate it.

    Virtual machine name with full colour emojis rendered

    No special hacks were required to do this, I simply entered the emojis as the virtual machine name when creating it from virt-manager’s wizard

    Virtual machine name with full colour emojis rendered

    As mentioned previously, GNOME terminal displays colour emojis, so these virtual machine names appear nicely when using virsh and other command line tools

    Virtual machine name rendered with full colour emojis in terminal commands

    The more observant readers will notice that the command line args have a bug as the snowman in the machine name is incorrectly rendered in the process listing. The actual data in /proc/$PID/cmdline is correct, so something about the “ps” command appears to be mangling it prior to output. It isn’t simply a font problem because other comamnds besides “ps” render properly, and if you grep the “ps” output for the snowman emoji no results are displayed.

    by Daniel Berrange at December 01, 2017 01:28 PM

    November 30, 2017

    RDO Blog

    Open Source Summit, Prague

    In October, RDO had a small presence at the Open Source Summit (formerly known as LinuxCon) in Prague, Czechia.

    While this event does not traditionally draw a big OpenStack audience, we were treated to a great talk by Monty Taylor on Zuul, and Fatih Degirmenci gave an interesting talk on cross-community CI, in which he discussed the joint work between the OpenStack and OpenDaylight communities to help one another verify cross-project functionality.

    centos_fedora On one of the evenings, members of the Fedora and CentOS community met in a BoF (Birds of a Feather) meeting, to discuss how the projects relate, and how some of the load - including the CI work that RDO does in the CentOS infrastructure - can better be shared between the two projects to reduce duplication of effort.

    This event is always a great place to interact with other open source enthusiasts. While, in the past, it was very Linux-centric, the event this year had a rather broader scope, and so drew people from many more communities.

    Upcoming Open Source Summits will be held in Japan (June 20-22, 2018), Vancouver (August 29-31, 2018) and Edinburgh (October 22-24, 2018), and we expect to have a presence of some kind at each of these events.

    by Rich Bowen at November 30, 2017 09:29 PM

    Upcoming changes to test day

    TL;DR: Live RDO cloud will be available for testing on upcoming test day. for more info.

    The last few test days have been somewhat lackluster, and have not had much participation. We think that there's a number of reasons for this:

    • Deploying OpenStack is hard and boring
    • Not everyone has the necessary hardware to do it anyways
    • Automated testing means that there's not much left for the humans to do

    In today's IRC meeting, we were brainstorming about ways to improve participation in test day.

    We think that, in addition to testing the new packages, it's a great way for you, the users, to see what's coming in future releases, so that you can start thinking about how you'll use this functionality.

    One idea that came out of it is to have a test cloud, running the latest packages, available to you during test day. You can get on there, poke around, break stuff, and help test it, without having to go through the pain of deploying OpenStack.

    David has written more about this on his blog.

    If you're interested in participating, please sign up.

    Please also give some thought to what kinds of test scenarios we should be running, and add those to the test page. Or, respond to this thread with suggestions of what we should be testing.

    Details about the upcoming test day may be found on the RDO website.


    by Rich Bowen at November 30, 2017 06:47 PM

    Getting started with Software Factory and Zuul3


    Software Factory 2.7 has been recently released. Software Factory is an easy to deploy software development forge that is deployed at and Software Factory provides, among other features, code review and continuous integration (CI). This new release features Zuul V3 that is, now, the default CI component of Software Factory.

    In this blog post I will explain how to deploy a Software Factory instance for testing purposes in less than 30 minutes and initialize two demo repositories to be tested via Zuul.

    Note that Zuul V3 is not yet released upstream however it is already in production, acting as the CI system of OpenStack.


    Software Factory requires CentOS 7 as its base Operating System so the commands listed below should be executed on a fresh deployment of CentOS 7.

    The default FQDN of a Software Factory deployment is In order to be accessible in your browser, must be added to your /etc/hosts with the IP address of your deployment.


    First, let's install the repository of the last version then install sf-config, the configuration management tool.

    sudo yum install -y
    sudo yum install -y sf-config

    Activating extra components

    Software Factory has a modular architecture that can be easily defined through a YAML configuration file, located in /etc/software-factory/arch.yaml. By default, only a limited set of components are activated to set up a minimal CI with Zuul V3.

    We will now add the hypervisor-oci component to configure a container provider, so that OCI containers can be consumed by Zuul when running CI jobs. In others words it means you won't need an OpenStack cloud account for running your first Zuul V3 jobs with this Software Factory instance.

    Note that the OCI driver, on which hypervisor-oci relies, while totally functional, is still under review and not yet merged upstream.

    echo "      - hypervisor-oci" | sudo tee -a /etc/software-factory/arch.yaml

    Starting the services

    Finally run sf-config:

    sudo sfconfig --enable-insecure-slaves --provision-demo

    When the sf-config command finishes you should be able to access the Software Factory web UI by connecting your browser to You should then be able to login using the login admin and password userpass (Click on "Toggle login form" to display the built-in authentication).

    Triggering a first job on Zuul

    The –provision-demo option is a special command to provision two demo Git repositories on Gerrit with two demo jobs.

    Let's propose a first change on it:

    sudo -i
    cd demo-project
    touch f1 && git add f1 && git commit -m"Add a test change" && git review

    Then you should see the jobs being executed on the ZuulV3 status page.

    Zuul buildset

    And get the jobs' results on the corresponding Gerrit review page.

    Gerrit change

    Finally, you should find the links to the generated artifacts and the ARA reports.

    ARA report

    Next steps to go further

    To learn more about Software Factory please refer to the user documentation. You can reach the Software Factory team on IRC freenode channel #softwarefactory or by email at the mailing list.

    by fboucher at November 30, 2017 06:47 PM

    November 29, 2017

    RDO Blog

    A summary of Sydney OpenStack Summit docs sessions

    Here I'd like to give a summary of the Sydney OpenStack Summit docs sessions that I took part in, and share my comments on them with the broader OpenStack community.

    Docs project update

    At this session, we discussed a recent major refocus of the Documentation project work and restructuring of the OpenStack official documentation. This included migrating documentation from the core docs suite to project teams who now own most of the content.

    We also covered the most important updates from the Documentation planning sessions held at the Denver Project Teams Gathering, including our new retention policy for End-of-Life documentation, which is now being implemented.

    This session was recorded, you can watch the recording here:

    Docs/i18n project onboarding

    This was a session jointly organized with the i18n community. Alex Eng, Stephen Finucane, and yours truly gave three short presentations on translating OpenStack, OpenStack + Sphinx in a tree, and introduction to the docs community, respectively.

    As it turned out, the session was not attended by newcomers to the community, instead, community members from various teams and groups joined us for the onboarding, which made it a bit more difficult to find out what the proper focus of the session should be to better accommodate the different needs and expectations of those in the audience. Definitely something to think about for the next Summit.

    Installation guides updates and testing

    I held this session to identify what are the views of the community on the future of installation guides and testing of installation procedures.

    The feedback received was mostly focused on three points:

    • A better feedback mechanism for new users who are the main audience here. One idea is to bring back comments at the bottom of install guides pages.

    • To help users better understand the processes described in instructions and the overall picture, provide more references to conceptual or background information.

    • Generate content from install shell scripts, to help with verification and testing.

    The session etherpad with more details can be found here:

    Ops guide transition and maintenance

    This session was organized by Erik McCormick from the OpenStack Operators community. There is an ongoing effort driven by the Ops community to migrate retired OpenStack Ops docs over to the OpenStack wiki, for easy editing.

    We mostly discussed a number of challenges related to maintaining the technical content in wiki, and how to make more vendors interested in the effort.

    The session etherpad can be found here:

    Documentation and relnotes, what do you miss?

    This session was run by Sylvain Bauza and the focus of the discussion was on identifying gaps in content coverage found after the documentation migration.

    Again, Ops-focused docs tuned out to be a hot topic as well as providing more detailed conceptual information together with the procedural content, and structuring of release notes. We should also seriously consider (semi-)automating checks for broken links.

    You can read more about the discussion points here:

    by Petr Kovar at November 29, 2017 07:25 PM

    November 28, 2017

    RDO Blog

    Anomaly Detection in CI logs

    Continous Integration jobs can generate a lot of data and it can take a lot of time to figure out what went wrong when a job fails. This article demonstrates new strategies to assist with failure investigations and to reduce the need to crawl boring log files.

    First, I will introduce the challenge of anomaly detection in CI logs. Second, I will present a workflow to automatically extract and report anomalies using a tool called LogReduce. Lastly, I will discuss the current limitations and how more advanced techniques could be used.


    Finding anomalies in CI logs using simple patterns such as "grep -i error" is not enough because interesting log lines doesn't necessarly feature obvious anomalous messages such as "error" or "failed". Sometime you don't even know what you are looking for.

    In comparaison to regular logs, such as system logs of a production service, CI logs have a very interresting characteristic: they are reproducible. Thus, it is possible to carefully look for new events that are not present in other job execution logs. This article focuses on this particular characteristic to detect anomalies.

    The challenge

    For this article, baseline events are defined as the collection of log lines produced by nominal jobs execution and target events are defined as the collection of log lines produced by a failed job run.

    Searching for anomalous events is challenging because:

    • Events can be noisy: they often includes unique features such as timestamps, hostnames or uuid.
    • Events can be scattered accross many differents files.
    • False positives events may appear for various reasons, for example when a new test option has been introduced. However they often share a common semantic with some baseline events.

    Moreover, there can be a very high number of events, for example, more than 1 million lines for tripleo jobs. Thus, we can not easily look for each target event not present in baseline events.

    OpenStack Infra CRM114

    It is worth noting that anomaly detection is already happening live in the openstack-infra operated review system using classify-log.crm, which is based on CRM114 bayesian filters.

    However it is currently only used to classify global failures in the context of the elastic-recheck process. The main drawbacks to using this tool are:

    • Events are processed per words without considering complete lines: it only computes the distances of up to a few words.
    • Reports are hard to find for regular users, they would have to go to elastic-recheck uncategorize, and click the crm114 links.
    • It is written in an obscure language


    This part presents the techniques I used in LogReduce to overcome the challenges described above.

    Reduce noise with tokenization

    The first step is to reduce the complexity of the events to simplify further processing. Here is the line processor I used, see the Tokenizer module:

    • Skip known bogus events such as ssh scan: "sshd.+[iI]nvalid user"
    • Remove known words:
      • Hashes which are hexa decimal words that are 32, 64 or 128 characters long
      • UUID4
      • Date names
      • Random prefixes such as (tmp|req-|qdhcp-)[^\s\/]+
    • Discard every character that is not [a-z_\/]

    For example this line:

      2017-06-21 04:37:45,827 INFO [nodepool.builder.UploadWorker.0] Uploading DIB image build 0000000002 from /tmpxvLOTg/fake-image-0000000002.qcow2 to fake-provider

    Is reduced to:

      INFO nodepool builder UploadWorker Uploading image build from /fake image fake provider

    Index events in a NearestNeighbors model

    The next step is to index baseline events. I used a NearestNeighbors model to query target events' distance from baseline events. This helps remove false-postive events that are similar from known baseline events. The model is fitted with all the baseline events transformed using Term Frequency Inverse Document Frequency (tf-idf). See the SimpleNeighbors model

    vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(
        analyzer='word', lowercase=False, tokenizer=None,
        preprocessor=None, stop_words=None)
    nn = sklearn.neighbors.NearestNeighbors(
    train_vectors = vectorizer.fit_transform(train_data)

    Instead of having a single model per job, I built a model per file type. This requires some pre-processing work to figure out what model to use per file. File names are converted to model names using another Tokenization process to group similar files. See the filename2modelname function.

    For example, the following files are grouped like so:

    audit.clf: audit/audit.log audit/audit.log.1
    merger.clf: zuul/merger.log zuul/merge.log.2017-11-12
    journal.clf: undercloud/var/log/journal.log overcloud/var/log/journal.log

    Detect anomalies based on kneighbors distance

    Once the NearestNeighbor model is fitted with baseline events, we can repeat the process of Tokenization and tf-idf transformation of the target events. Then using the kneighbors query we compute the distance of each target event.

    test_vectors = vectorizer.transform(test_data)
    distances, _ = nn.kneighbors(test_vectors, n_neighbors=1)

    Using a distance threshold, this technique can effectively detect anomalies in CI logs.

    Automatic process

    Instead of manually running the tool, I added a server mode that automatically searches and reports anomalies found in failed CI jobs. Here are the different components:

    • listener connects to mqtt/gerrit event-stream/ and collects all success and failed job.

    • worker processes jobs collected by the listener. For each failed job, it does the following in pseudo-code:

    Build model if it doesn't exist or if it is too old:
    	For each last 5 success jobs (baseline):
    		Fetch logs
    	For each baseline file group:
    		Tokenize lines
    		TF-IDF fit_transform
    		Fit file group model
    Fetch target logs
    For each target file:
    	Look for the file group model
    	Tokenize lines
    	TF-IDF transform
    	file group model kneighbors search
    	yield lines that have distance > 0.2
    Write report
    • publisher processes each report computed by the worker and notifies:
      • IRC channel
      • Review comment
      • Mail alert (e.g. periodic job which doesn't have a associated review)

    Reports example

    Here are a couple of examples to illustrate LogReduce reporting.

    In this change I broke a service configuration (zuul gerrit port), and logreduce correctly found the anomaly in the service logs (zuul-scheduler can't connect to gerrit): sf-ci-functional-minimal report

    In this tripleo-ci-centos-7-scenario001-multinode-oooq-container report, logreduce found 572 anomalies out of a 1078248 lines. The interesting ones are:

    • Non obvious new DEBUG statements in /var/log/containers/neutron/neutron-openvswitch-agent.log.txt.
    • New setting of the firewall_driver=openvswitch in neutron was detected in:
      • /var/log/config-data/neutron/etc/neutron/plugins/ml2/ml2_conf.ini.txt
      • /var/log/extra/docker/docker_allinfo.log.txt
    • New usage of cinder-backup was detected accross several files such as:
      • /var/log/journal contains new puppet statement
      • /var/log/cluster/corosync.log.txt
      • /var/log/pacemaker/bundles/rabbitmq-bundle-0/rabbitmq/rabbit@centos-7-rax-iad-0000787869.log.txt.gz
      • /etc/puppet/hieradata/service_names.json
      • /etc/sensu/conf.d/client.json.txt
      • pip2-freeze.txt
      • rpm-qa.txt

    Caveats and improvements

    This part discusses the caveats and limitations of the current implementation and suggests other improvements.

    Empty success logs

    This method doesn't work when the debug events are only included in the failed logs. To successfully detect anomalies, failure and success logs need to be similar, otherwise all the extra information in failed logs will be considered anomalous.

    This situation happens with testr results where success logs only contain 'SUCCESS'.

    Building good baseline model

    Building a good baseline model with nominal job events is key to anomaly detection. We could use periodic execution (with or without failed runs), or the gate pipeline.

    Unfortunately Zuul currently lacks build reporting and we have to scrap gerrit comments or status web pages, which is sub-optimal. Hopefully the upcomming zuul-web builds API and zuul-scheduler MQTT reporter will make this task easier to implement.

    Machine learning

    I am by no means proficient at machine learning. Logreduce happens to be useful as it is now. However here are some other strategies that may be worth investigating.

    The model is currently using a word dictionnary to build the features vector and this may be improved by using different feature extraction techniques more suited for log line events such as MinHash and/or Locality Sensitive Hash.

    The NearestNeighbors kneighbors query tends to be slow for large samples and this may be improved upon by using Self Organizing Map, RandomForest or OneClassSVM model.

    When line sizes are not homogeneous in a file group, then the model doesn't work well. For example, mistral/api.log line size varies between 10 and 8000 characters. Using models per bins based on line size may be a great improvement.

    CI logs analysis is a broad subject on its own, and I suspect someone good at machine learning might be able to find other clever anomaly detection strategies.

    Further processing

    Detected anomalies could be further processed by:

    • Merging similar anomalies discovered accross different files.
    • Looking for known anomalies in a system like elastic-recheck.
    • Reporting new anomalies to elastic-recheck so that affected jobs could be grouped.


    CI log analysis is a powerful service to assist failure investigations. The end goal would be to report anomalies instead of exhaustive job logs.

    Early results of LogReduce models look promising and I hope we could setup such services for any CI jobs in the future. Please get in touch by mail or irc (tristanC on Freenode) if you are interrested.

    by tristanC at November 28, 2017 06:13 AM

    November 27, 2017

    RDO Blog

    Emilien Macchi talks TripleO at OpenStack Summit

    While at OpenStack Summit, I had an opportunity to talk with Emilien Macchi about the work on TripleO in the Pike and Queens projects.

    by Rich Bowen at November 27, 2017 09:39 PM

    OpenStack 3rd Party CI with Software Factory


    When developing for an OpenStack project, one of the most important aspects to cover is to ensure proper CI coverage of our code. Each OpenStack project runs a number of CI jobs on each commit to test its validity, so thousands of jobs are run every day in the upstream infrastructure.

    In some cases, we will want to set up an external CI system, and make it report as a 3rd Party CI on certain OpenStack projects. This may be because we want to cover specific software/hardware combinations that are not available in the upstream infrastructure, or want to extend test coverage beyond what is feasible upstream, or any other reason you can think of.

    While the process to set up a 3rd Party CI is documented, some implementation details are missing. In the RDO Community, we have been using Software Factory to power our 3rd Party CI for OpenStack, and it has worked very reliably over some cycles.

    The main advantage of Software Factory is that it integrates all the pieces of the OpenStack CI infrastructure in an easy to consume package, so let's have a look at how to build a 3rd party CI from the ground up.


    You will need the following:

    • An OpenStack-based cloud, which will be used by Nodepool to create temporary VMs where the CI jobs will run. It is important to make sure that the default security group in the tenant accepts SSH connections from the Software Factory instance.
    • A CentOS 7 system for the Software Factory instance, with at least 8 GB of RAM and 80 GB of disk. It can run on the OpenStack cloud used for nodepool, just make sure it is running on a separate project.
    • DNS resolution for the Software Factory system.
    • A 3rd Party CI user on Follow this guide to configure it.
    • Some previous knowledge on how Gerrit and Zuul work is advisable, as it will help during the configuration process.

    Basic Software Factory installation

    For a detailed installation walkthrough, refer to the Software Factory documentation. We will highlight here how we set it up on a test VM.

    Software installation

    On the CentOS 7 instance, run the following commands to install the latest release of Software Factory (2.6 at the time of this article):

    $ sudo yum install -y
    $ sudo yum update -y
    $ sudo yum install -y sf-config

    Define the architecture

    Software Factory has several optional components, and can be set up to run them on more than one system. In our setup, we will install the minimum required components for a 3rd party CI system, all in one.

    $ sudo vi /etc/software-factory/arch.yaml

    Make sure the nodepool-builder role is included. Our file will look like:

    description: "OpenStack 3rd Party CI deployment"
      - name: managesf
          - install-server
          - mysql
          - gateway
          - cauth
          - managesf
          - gitweb
          - gerrit
          - logserver
          - zuul-server
          - zuul-launcher
          - zuul-merger
          - nodepool-launcher
          - nodepool-builder
          - jenkins

    In this setup, we are using Jenkins to run our jobs, so we need to create an additional file:

    $ sudo vi /etc/software-factory/custom-vars.yaml

    And add the following content

    nodepool_zuul_launcher_target: False

    Note: As an alternative, we could use zuul-launcher to run our jobs and drop Jenkins. In that case, there is no need to create this file. However, later when defining our jobs we will need to use the jobs-zuul directory instead of jobs in the config repo.

    Edit Software Factory configuration

    $ sudo vi /etc/software-factory/sfconfig.yaml

    This file contains all the configuration data used by the sfconfig script. Make sure you set the following values:

    • Password for the default admin user.
      admin_password: supersecurepassword
    • The fully qualified domain name for your system.
    • The OpenStack cloud configuration required by Nodepool.
      - auth_url:
        name: microservers
        password: cloudsecurepassword
        project_name: mytestci
        region_name: RegionOne
        regions: []
        username: ciuser
    • The authentication options if you want other users to be able to log into your instance of Software Factory using OAuth providers like GitHub. This is not mandatory for a 3rd party CI. See this part of the documentation for details.

    • If you want to use LetsEncrypt to get a proper SSL certificate, set:

      use_letsencrypt: true

    Run the configuration script

    You are now ready to complete the configuration and get your basic Software Factory installation running.

    $ sudo sfconfig

    After the script finishes, just point your browser to https:// and you can see the Software Factory interface.

    SF interface

    Configure SF to connect to the OpenStack Gerrit

    Once we have a basic Software Factory environment running, and our service account set up in, we just need to connect both together. The process is quite simple:

    • First, make sure the local Zuul user SSH key, found at /var/lib/zuul/.ssh/, is added to the service account at

    • Then, edit /etc/software-factory/sfconfig.yaml again, and edit the zuul section to look like:

      default_log_site: sflogs
      external_logservers: []
      - name: openstack
        port: 29418
        username: mythirdpartyciuser
    • Finally, run sfconfig again. Log information will start flowing in /var/log/zuul/server.log, and you will see a connection to port 29418.

    Create a test job

    In Software Factory 2.6, a special project named config is automatically created on the internal Gerrit instance. This project holds the user-defined configuration, and changes to the project must go through Gerrit.

    Configure images for nodepool

    All CI jobs will use a predefined image, created by Nodepool. Before creating any CI job, we need to prepare this image.

    • As a first step, add your SSH public key to the admin user in your Software Factory Gerrit instance.

    Add SSH Key

    • Then, clone the config repo on your computer and edit the nodepool configuration file:
    $ git clone ssh:// sf-config
    $ cd sf-config
    $ vi nodepool/nodepool.yaml
    • Define the disk image and assign it to the OpenStack cloud defined previously:
      - name: dib-centos-7
          - centos-minimal
          - nodepool-minimal
          - simple-init
          - sf-jenkins-worker
          - sf-zuul-worker
          DIB_CHECKSUM: '1'
          QEMU_IMG_OPTIONS: compat=0.10
          DIB_GRUB_TIMEOUT: '0'
      - name: dib-centos-7
        image: dib-centos-7
        min-ready: 1
          - name: microservers
      - name: microservers
        cloud: microservers
        clean-floating-ips: true
        image-type: raw
        max-servers: 10
        boot-timeout: 120
        pool: public
        rate: 2.0
          - name: private
          - name: dib-centos-7
            diskimage: dib-centos-7
            username: jenkins
            min-ram: 1024
            name-filter: m1.medium

    First, we are defining the diskimage-builder elements that will create our image, named dib-centos-7.

    Then, we are assigning that image to our microservers cloud provider, and specifying that we want to have at least 1 VM ready to use.

    Finally we define some specific parameters about how Nodepool will use our cloud provider: the internal (private) and external (public) networks, the flavor for the virtual machines to create (m1.medium), how many seconds to wait between operations (2.0 seconds), etc.

    • Now we can submit the change for review:
    $ git add nodepool/nodepool.yaml
    $ git commit -m "Nodepool configuration"
    $ git review
    • In the Software Factory Gerrit interface, we can then check the open change. The config repo has some predefined CI jobs, so you can check if your syntax was correct. Once the CI jobs show a Verified +1 vote, you can approve it (Code Review +2, Workflow +1), and the change will be merged in the repository.

    • After the change is merged in the repository, you can check the logs at /var/log/nodepool and see the image being created, then uploaded to your OpenStack cloud.

    Define test job

    There is a special project in OpenStack meant to be used to test 3rd Party CIs, openstack-dev/ci-sandbox. We will now define a CI job to "check" any new commit being reviewed there.

    • Assign the nodepool image to the test job
    $ vi jobs/projects.yaml

    We are going to use a pre-installed job named demo-job. All we have to do is to ensure it uses the image we just created in Nodepool.

    - job:
        name: 'demo-job'
        defaults: global
          - prepare-workspace
          - shell: |
              cd $ZUUL_PROJECT
              echo "This is a demo job"
          - zuul
        node: dib-centos-7
    • Define a Zuul pipeline and a job for the ci-sandbox project
    $ vi zuul/upstream.yaml

    We are creating a specific Zuul pipeline for changes coming from the OpenStack Gerrit, and specifying that we want to run a CI job for commits to the ci-sandbox project:

      - name: openstack-check
        description: Newly uploaded patchsets enter this pipeline to receive an initial +/-1 Verified vote from Jenkins.
        manager: IndependentPipelineManager
        source: openstack
        precedence: normal
          open: True
          current-patchset: True
            - event: patchset-created
            - event: change-restored
            - event: comment-added
              comment: (?i)^(Patch Set [0-9]+:)?( [\w\\+-]*)*(\n\n)?\s*(recheck|reverify)
            verified: 0
            verified: 0
      - name: openstack-dev/ci-sandbox
          - demo-job

    Note that we are telling our job not to send a vote for now (verified: 0). We can change that later if we want to make our job voting.

    • Apply configuration change
    $ git add zuul/upstream.yaml jobs/projects.yaml
    $ git commit -m "Zuul configuration for 3rd Party CI"
    $ git review

    Once the change is merged, Software Factory's Zuul process will be listening for changes to the ci-sandbox project. Just try creating a change and see if everything works as expected!


    If something does not work as expected, here are some troubleshooting tips:

    Log files

    You can find the Zuul log files in /var/log/zuul. Zuul has several components, so start with checking server.log and launcher.log, the log files for the main server and the process that launches CI jobs.

    The Nodepool log files are located in /var/log/nodepool. builder.log contains the log from image builds, while nodepool.log has the log for the main process.

    Nodepool commands

    You can check the status of the virtual machines created by nodepool with:

    $ sudo nodepool list

    Also, you can check the status of the disk images with:

    $ sudo nodepool image-list

    Jenkins status

    You can see the Jenkins status from the GUI, at https:///jenkins/, if logged on with the admin user. If no machines show up at the 'Build Executor Status' pane, that means that either Nodepool could not launch a VM, or there was some issue in the connection between Zuul and Jenkins. In that case, check the jenkins logs at `/var/log/jenkins`, or restart the service if there are errors.

    Next steps

    For now, we have only ran a test job against a test project. The real power comes when you create a proper CI job on a project you are interested in. You should now:

    • Create a file under jobs/ with the JJB definition for your new job.

    • Edit zuul/upstream.yaml to add the project(s) you want your 3rd Party CI system to watch.

    by jpena at November 27, 2017 11:58 AM

    November 22, 2017

    Derek Higgins

    Booting baremetal from a Cinder Volume in TripleO

    Up until recently in tripleo booting, from a cinder volume was confined to virtual instances, but now thanks to some recent work in ironic, baremetal instances can also be booted backed by a cinder volume.

    Below I’ll go through the process of how to take a CentOS cloud image, prepare and load it into a cinder volume so that it can be used to back the root partition of a baremetal instance.

    First I do make a few assumptions

    1. you have a working ironic in a tripleo overcloud
      – if this isn’t something you’re familiar with you’ll find some instructions here
      – If you can boot and ssh to a baremetal instance on the provisioning network then your good to go
    2. You have a working cinder in the TripleO overcloud with enough storage to store the volumes
    3. I’ve tested tripleo(and openstack) using RDO as of 2017-11-14, earlier versions had at least one bug and wont work


    Baremetal instances in the overcloud traditionally use config-drive for cloud-init to read config data from, config-drive isn’t supported in ironic boot from volume, so we need to make sure that the metadata service is available. To do this, if your subnet isn’t already attached to one, you need to create a neutron router and attach it to the subnet you’ll be booting your baremetal instances with,

     $ neutron router-create r1
     $ neutron router-interface-add r1 provisioning-subnet

    Each node defined in ironic that you would like to use for booting from volume needs to use the cinder storage driver, the iscsi_boot capability needs to be set and it requires a unique connector id (increment <NUM> for each node)

     $ openstack baremetal node set --property capabilities=iscsi_boot:true --storage-interface cinder <NODEID>
     $ openstack baremetal volume connector create --node <NODEID> --type iqn --connector-id<NUM>

    The last thing you’ll need is a image capable of booting from iscsi, we’ll be starting with the Centos Cloud image but need to alter it slightly so that its capable of booting over iscsi

    1. download the image

     $ curl > /tmp/CentOS-7-x86_64-GenericCloud.qcow2.xz
     $ unxz /tmp/CentOS-7-x86_64-GenericCloud.qcow2.xz

    2. mount it and change root into the image

     $ mkdir /tmp/mountpoint
     $ guestmount -i -a /tmp/CentOS-7-x86_64-GenericCloud.qcow2 /tmp/mountpoint
     $ chroot /tmp/mountpoint /bin/bash

    3. load the dracut iscsi module into the ramdisk

     chroot> mv /etc/resolv.conf /etc/resolv.conf_
     chroot> echo "nameserver" > /etc/resolv.conf
     chroot> yum install -y iscsi-initiator-utils
     chroot> mv /etc/resolv.conf_ /etc/resolv.conf
     # Be careful here to update the correct ramdisk (check/boot/grub2/grub.cfg)
     chroot> dracut --force --add "network iscsi" /boot/initramfs-3.10.0-693.5.2.el7.x86_64.img 3.10.0-693.5.2.el7.x86_64

    4. enable rd.iscsi.firmware so that dracut gets the iscsi target details from the firmware[1]

    The kernel must be booted with rd.iscsi.firmware=1 so that the iscsi target details are read from the firmware (passed to it by ipxe), this needs to be added to the grub config

    In the chroot Edit the file /etc/default/grub and add rd.iscsi.firmware=1 to GRUB_CMDLINE_LINUX=…


    5. leave the chroot, unmount the image and update the grub config

     chroot> exit
     $ guestunmount /tmp/mountpoint
     $ guestfish -a /tmp/CentOS-7-x86_64-GenericCloud.qcow2 -m /dev/sda1 sh "/sbin/grub2-mkconfig -o /boot/grub2/grub.cfg"

    You now have a image that is capable of mounting its root disk over iscsi, load it into glance and create a volume from it

     $ openstack image create --disk-format qcow2 --container-format bare --file /tmp/CentOS-7-x86_64-GenericCloud.qcow2 centos-bfv
     $ openstack volume create --size 10 --image centos-bfv --bootable centos-test-volume

    Once the cinder volume is finish creating(wait for it to become “available”) you should be able to boot a baremetal instance from the newly created cinder volume

     $ openstack server create --flavor baremetal --volume centos-test-volume --key default centos-test
     $ nova list 
     $ ssh centos@
    [centos@centos-test ~]$ lsblk
    sda 8:0 0 10G 0 disk 
    └─sda1 8:1 0 10G 0 part /
    vda 253:0 0 80G 0 disk 
    [centos@centos-test ~]$ ls -l /dev/disk/by-path/
    total 0
    lrwxrwxrwx. 1 root root 9 Nov 14 16:59 -> ../../sda
    lrwxrwxrwx. 1 root root 10 Nov 14 16:59 -> ../../sda1
    lrwxrwxrwx. 1 root root 9 Nov 14 16:58 virtio-pci-0000:00:04.0 -> ../../vda

    To see how the cinder volume target information is being passed to the hardware you need to take a look at the iPXE template for the server in questions e.g.

     $ cat /var/lib/ironic/httpboot/<NODEID>/config
    set username vRefJtDXrEyfDUetpf9S
    set password mD5n2hk4FEvNBGSh
    set initiator-iqn
    sanhook --drive 0x80 || goto fail_iscsi_retry
    sanboot --no-describe || goto fail_iscsi_retry

    [1] – due to bug in dracut(now fixed upstream [2]) setting this means that the image can’t be used for local boot
    [2] –

    by higginsd at November 22, 2017 12:23 AM

    November 21, 2017

    RDO Blog

    Recent blog posts

    It's been a little while since we've posted a roundup of blogposts around RDO, and you all have been rather prolific in the past month!

    Here's what we as a community have been talking about:

    Hooroo! Australia bids farewell to incredible OpenStack Summit by August Simonelli, Technical Marketing Manager, Cloud

    We have reached the end of another successful and exciting OpenStack Summit. Sydney did not disappoint giving attendees a wonderful show of weather ranging from rain and wind to bright, brilliant sunshine. The running joke was that Sydney was, again, just trying to be like Melbourne. Most locals will get that joke, and hopefully now some of our international visitors do, too!


    Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 2 by Michele Naldini

    Welcome back, here we will continue with the second part of my post, where we will work with Red Hat Cloudforms. If you remember, in our first post we spoke about Red Hat OpenStack Platform 11 (RHOSP). In addition to the blog article, at the end of this article is also a demo video I created to show to our customers/partners how they can build a fully automated software data center.


    Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 1 by Michele Naldini

    In this blog, I would like to show you how you can create your fully software-defined data center with two amazing Red Hat products: Red Hat OpenStack Platform and Red Hat CloudForms. Because of the length of this article, I have broken this down into two parts.


    G’Day OpenStack! by August Simonelli, Technical Marketing Manager, Cloud

    In less than one week the OpenStack Summit is coming to Sydney! For those of us in the Australia/New Zealand (ANZ) region this is a very exciting time as we get to showcase our local OpenStack talents and successes. This summit will feature Australia’s largest banks, telcos, and enterprises and show the world how they have adopted, adapted, and succeeded with Open Source software and OpenStack.


    Restarting your TripleO hypervisor will break cinder volume service thus the overcloud pingtest by Carlos Camacho

    I don’t usualy restart my hypervisor, today I had to install LVM2 and virsh stopped to work so a restart was required, once the VMs were up and running the overcloud pingtest failed as cinder was not able to start.


    CERN CentOS Dojo, Part 4 of 4, Geneva by rbowen

    On Friday evening, I went downtown Geneva with several of my colleagues and various people that had attended the event.


    CERN CentOS Dojo, part 3 of 4: Friday Dojo by rbowen

    On Friday, I attended the CentOS Dojo at CERN, in Meyrin Switzerland.


    CERN Centos Dojo, event report: 2 of 4 – CERN tours by rbowen

    The second half of Thursday was where we got to geek out and tour various parts of CERN.


    CERN Centos Dojo 2017, Event Report (1 of 4): Thursday Meeting by rbowen

    On Thursday, prior to the main event, a smaller group of CentOS core community got together for some deep-dive discussions around the coming challenges that the project is facing, and constructive ways to address them.


    CERN Centos Dojo 2017, Event report (0 of 4) by rbowen

    For the last few days I’ve been in Geneva for the CentOS dojo at CERN.


    Using Ansible Openstack modules on CentOS 7 by Fabian Arrotin

    Suppose that you have a RDO/Openstack cloud already in place, but that you'd want to automate some operations : what can you do ? On my side, I already mentioned that I used puppet to deploy initial clouds, but I still prefer Ansible myself when having to launch ad-hoc tasks, or even change configuration[s]. It's particulary true for our CI environment where we run "agentless" so all configuration changes happen through Ansible.


    Using Falcon to cleanup Satellite host records that belong to terminated OSP instances by Simeon Debreceni

    In an environment where OpenStack instances are automatically subscribed to Satellite, it is important that Satellite is notified of terminated instances so that is can safely delete its host record. Not doing so will:


    My interview with Cool Python Codes by Julien Danjou

    A few days ago, I was contacted by Godson Rapture from Cool Python codes to answer a few questions about what I work on in open source. Godson regularly interviews developers and I invite you to check out his website!


    Using Red Hat OpenStack Platform director to deploy co-located Ceph storage – Part Two by Dan Macpherson, Principal Technical Writer

    Previously we learned all about the benefits in placing Ceph storage services directly on compute nodes in a co-located fashion. This time, we dive deep into the deployment templates to see how an actual deployment comes together and then test the results!


    Using Red Hat OpenStack Platform director to deploy co-located Ceph storage – Part One by Dan Macpherson, Principal Technical Writer

    An exciting new feature in Red Hat OpenStack Platform 11 is full Red Hat OpenStack Platform director support for deploying Red Hat Ceph storage directly on your overcloud compute nodes. Often called hyperconverged, or HCI (for Hyperconverged Infrastructure), this deployment model places the Red Hat Ceph Storage Object Storage Daemons (OSDs) and storage pools directly on the compute nodes.


    by Rich Bowen at November 21, 2017 02:55 PM

    November 16, 2017

    Red Hat Stack

    Hooroo! Australia bids farewell to incredible OpenStack Summit

    We have reached the end of another successful and exciting OpenStack Summit. Sydney did not disappoint giving attendees a wonderful show of weather ranging from rain and wind to bright, brilliant sunshine. The running joke was that Sydney was, again, just trying to be like Melbourne. Most locals will get that joke, and hopefully now some of our international visitors do, too!

    keynote-asMonty Taylor (Red Hat), Mark Collier (OpenStack Foundation), and Lauren Sell (OpenStack Foundation) open the Sydney Summit. (Photo: Author)

    And much like the varied weather, the Summit really reflected the incredible diversity of both technology and community that we in the OpenStack world are so incredibly proud of. With over 2300 attendees from 54 countries, this Summit was noticeably more intimate but no less dynamic. Often having a smaller group of people allows for a more personal experience and increases the opportunities for deep, important interactions.

    To my enjoyment I found that, unlike previous Summits, there wasn’t as much of a singularly dominant technological theme. In Boston it was impossible to turn a corner and not bump into a container talk. While containers were still a strong theme here in Sydney, I felt the general impetus moved away from specific technologies and into use cases and solutions. It feels like the OpenStack community has now matured to the point that it’s able to focus less on each specific technology piece and more on the business value those pieces create when working together.

    openkeynoteJonathan Bryce (OpenStack Foundation) (Photo: Author)

    It is exciting to see both Red Hat associates and customers following this solution-based thinking with sessions demonstrating the business value that our amazing technology creates. Consider such sessions as SD-WAN – The open source way, where the complex components required for a solution are reviewed, and then live demoed as a complete solution. Truly exceptional. Or perhaps check out an overview of how the many components to an NFV solution come together to form a successful business story in A Telco Story of OpenStack Success.

    At this Summit I felt that while the sessions still contained the expected technical content they rarely lost sight of the end goal: that OpenStack is becoming a key, and necessary, component to enabling true enterprise business value from IT systems.

    To this end I was also excited to see over 40 sessions from Red Hat associates and our customers covering a wide range of industry solutions and use cases.  From Telcos to Insurance companies it is really exciting to see both our associates and our customers sharing their experiences with our solutions, especially in Australia and New Zealand with the world.

    paddyMark McLoughlin, Senior Director of Engineering at Red Hat with Paddy Power Betfair’s Steven Armstrong and Thomas Andrew getting ready for a Facebook Live session (Photo: Anna Nathan)

    Of course, there were too many sessions to attend in person, and with the wonderfully dynamic and festive air of the Marketplace offering great demos, swag, food, and, most importantly, conversations, I’m grateful for the OpenStack Foundation’s rapid publishing of all session videos. It’s a veritable pirate’s bounty of goodies and I recommend checking it out sooner rather than later on their website.

    I was able to attend a few talks from Red Hat customers and associates that really got me thinking and excited. The themes were varied, from the growing world of Edge computing, to virtualizing network operations, to changing company culture; Red Hat and our customers are doing very exciting things.

    Digital Transformation

    Take for instance Telstra, who are using Red Hat OpenStack Platform as part of a virtual router solution. Two years ago the journey started with a virtualized network component delivered as an internal trial. This took a year to complete and was a big success from both a technological and cultural standpoint. As Senior Technology Specialist Andrew Harris from Telstra pointed out during the Q and A of his session, projects like this are not only about implementing new technology but also about “educating … staff in Linux, OpenStack and IT systems.” It was a great session co-presented with Juniper and Red Hat and really gets into how Telstra are able to deliver key business requirements such as reliability, redundancy, and scale while still meeting strict cost requirements.


    Of course this type of digital transformation story is not limited to telcos. The use of OpenStack as a catalyst for company change as well as advanced solutions was seen strongly in two sessions from Australia’s Insurance Australia Group (IAG). Product

    IAGEddie Satterly, IAG (Photo: Author)

    Engineering and DataOps Lead Eddie Satterly recounted the journey IAG took to consolidate data for a better customer experience using open source technologies. IAG uses Red Hat OpenStack Platform as the basis for an internal open source revolution that has not only lead to significant cost savings but has even resulted in the IAG team open sourcing some of the tools that made it happen. Check out the full story of how they did it and join TechCrunch reporter Frederic Lardinois who chats with Eddie about the entire experience. There’s also a Facebook live chat Eddie did with Mark McLoughlin, Senior Director of Engineering at Red Hat that further tells their story.




    An area of excitement for those of us with roots in the operational space is the way that OpenStack continues to become easier to install and maintain. The evolution of TripleO, the upstream project for Red Hat OpenStack Platform’s deployment and lifecycle management tool known as director, has really reached a high point in the Pike cycle. With Pike, TripleO has begun utilizing Ansible as the core “engine” for upgrades, container orchestration, and lifecycle management. Check out Senior Principal Software Engineer Steve Hardy’s deep dive into all the cool things TripleO is doing and learn just how excited the new “openstack overcloud config download” command is going to make you, and your Ops team, become.

    jarda2Steve Hardy (Red Hat) and Jaromir Coufal (Red Hat) (Photo: Author)

    And as a quick companion to Steve’s talk, don’t miss his joint lightening talk with Red Hat Senior Product Manager Jaromir Coufal, lovingly titled OpenStack in Containers: A Deployment Hero’s Story of Love and Hate, for an excellent 10 minute intro to the journey of OpenStack, containers, and deployment.

    Want more? Don’t miss these sessions …

    Storage and OpenStack:

    Containers and OpenStack:

    Telcos and OpenStack

    A great event

    Although only 3 days long, this Summit really did pack a sizeable amount of content into that time. Being able to have the OpenStack world come to Sydney and enjoy a bit of Australian culture was really wonderful. Whether we were watching the world famous Melbourne Cup horse race with a room full of OpenStack developers and operators, or cruising Sydney’s famous harbour and talking the merits of cloud storage with the community, it really was an unique and exceptional week.

    melcup2The Melbourne Cup is about to start! (Photo: Author)

    The chance to see colleagues from across the globe, immersed in the technical content and environment they love, supporting and learning alongside customers, vendors, and engineers is incredibly exhilarating. In fact, despite the tiredness at the end of each day, I went to bed each night feeling more and more excited about the next day, week, and year in this wonderful community we call OpenStack!

    See you in Vancouver!

    image1Photo: Darin Sorrentino

    by August Simonelli, Technical Marketing Manager, Cloud at November 16, 2017 10:56 PM

    November 02, 2017

    Red Hat Developer Blog

    Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 2

    Welcome back, here we will continue with the second part of my post, where we will work with Red Hat Cloudforms. If you remember, in our first post we spoke about Red Hat OpenStack Platform 11 (RHOSP). In addition to the blog article, at the end of this article is also a demo video I created to show to our customers/partners how they can build a fully automated software data center.

    Hands-on – Part 2

    Well, now we need something that can grant us a single pane of glass to manage our environment.

    Something that will manage our hypervisors, private/public cloud, Sdn, Paas granting self-service capabilities, compliance and governance, predicting bottleneck, granting forecast accuracy, chargeback/showback, and deep analyze and the securing of our environment.

    Here the product name is Red Hat CloudForms.

    Picture 1

    I invite you to read our blog and our official documentation to fully understand how amazing CloudForms is!

    Now let’s start with some geek stuff here.

    We would like to grant to our end users the availability of the same heat stack but in a self-service fashion.

    CloudForms though the self-service UI is able to show to our users different type of service items (VM provisioning on different providers, heat stack execution, Ansible Playbook as a Service, etc.), combining them in a Service Catalog or a Service Catalog bundle.

    In some cases, you would like to present a Self Service Dialog composed of a simple text box or a checkbox or a drop-down list or more or less whatever you want to grant to your users a simple UI to order their services with a few clicks!

    Let me show you in practice what I mean.

    You need to download the Red Hat CloudForms appliance (qcow2 format) from the Red Hat customer portal and then import it on KVM.

    Remember to setup CF using the appliance_console textual menu and to add a dedicated disk for the VMDB (postgres) as pointed out here. [1]

    Please be aware that CloudForms is fully supported on RH-V, OpenStack, Vmware Vsphere, Azure, Amazon EC2…but not on rhel + kvm so DON’T use this configuration for a production environment.

    A full list of platforms able to host CloudForms appliance is available here. [2]

    Let’s start importing our heat stack inside CloudForms from the administrative interface.

    From Services -> Orchestration Templates -> Configuration -> Create New Orchestration Template you will be able to create your stack.

    Picture 2

    A Name, Description, and our stack content are required; then you can click on Add.

    Picture 3

    Now we have to create our Service Dialog to manage input parameters of our stack.

    From Configuration -> Create Service Dialog from Orchestration Template, name your dialog e.g stack-3tier-heat-service dialog.

    Picture 4

    Now, let’s go to Automation -> Automate -> Customization to verify if the service dialog was correctly created.

    Picture 5

    Click on Configuration -> Edit.

    We would like to hide some input parameters because usually your customers/end users are not aware of the architectural details (for instance the stack name or tenant_id or the management/Web Provider network id).

    So let’s edit at least these values by unselecting the checkbox Visible and Required and putting a Default Value.

    Below is an example of the stack name that will be called “demo-3tier-stack” and will not be shown to the end user.

    Picture 6

    Repeat the same configuration at least for Stack Prefix, Management Network, and Web Provider Network.

    Please be aware that Management Network and Web Provider Network will be attached to our OpenStack External Network so here we need to put the correct network ID.

    In our case, from our all-in-one rhosp, we can get this value with this command:

    [root@osp ~(keystone_admin)]# openstack network list -f value -c ID -c Name | grep FloatingIpWeb

    a18e0aa1-88ab-44d3-b751-ec3dfa703060 FloatingIpWeb

    Picture 7

    After doing our modification, we’ll see a preview of our Service Dialog.

    Picture 8

    Cool! Now that we have created our orchestration template and a service dialog let’s create our service catalog going to Services -> Catalog -> Catalogs ->All Catalogs.

    Now click on Configuration -> Add a New Catalog.

    Picture 9

    Picture 10

    We have to add the last thing, the service catalog item to be created under our “DEMO” service catalog.

    Go to Services -> Catalogs -> Catalog Items and select our “DEMO” Catalog.

    Picture 11

    Select the Orchestration as Catalog Item Type and fill the required fields (Display in Catalog is very important).

    Picture 12

    If you want to restrict the visibility of the service catalog item you can select a tag to be assigned from Policy -> Edit Tags.

    In this case, I’ve previously created a user (developer) member of a group (Demo-developers-grp) with a custom role called “Demo-Developers”.

    Picture 13

    Picture 14

    I have granted the custom role, Demo-Developers only access to the Services Feature so our developer users will be able to order, see, and manage services items from the self-service catalog.

    In addition, I have extended the rbac capabilities of our group assigning a custom tag called “Access” to the user group (Picture 13) and to the service item (Picture 8).

    This map permits to the users’ member of the Demo-Developers group to request and manage service items tagged with the same value (Note the My Company tags value on the previous images).

    Now we can order our service catalog item so let’s switch to Self Service User Interface (SSUI) pointing the url https://[IPADDRESS]/self_service and log in as a developer we’ll see our 3 tier-stack service item.

    Picture 15

    Click on the service then select the CloudForms tenant (tenant mapping in this setup is enabled so cloudform tenants map exactly our OpenStack tenants). If you want to change input parameters or leave them as default.

    Picture 16

    Let’s proceed to order our service item clicking on Add to Shopping Cart and then on Order.

    Picture 17

    I have edited the standard Service Provision method to request approval in case the request comes from the Developer Group, so as admin and from the web admin interface, approve the request from Services -> Requests.

    Picture 18

    After the approval, we can switch back to the Self Service UI where we’ll find under My Services what we have ordered and in a few minutes all the cool stuff created inside OpenStack.

    Picture 19

    Picture 20

    Picture 21

    Demo Video


    This is a just an example of how you can create a functional, fully automated, scalable Software Defined Data Center with the help of Red Hat Products and Services

    We are more than happy to help you and your organization to reach your business and technical objectives.

    This blog highlights part of the job we did for an important transformation project for a big financial customer.

    A big THANKS goes to my colleagues Francesco Vollero, Matteo Piccinini, Christian Jung, Nick Catling, Alessandro Arrichiello and Luca Miccini for their support during.




    Whether you are new to Linux or have experience, downloading this cheat sheet can assist you when encountering tasks you haven’t done lately.


    The post Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 2 appeared first on RHD Blog.

    by Michele Naldini at November 02, 2017 02:00 PM

    Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 1

    In this blog, I would like to show you how you can create your fully software-defined data center with two amazing Red Hat products: Red Hat OpenStack Platform and Red Hat CloudForms. Because of the length of this article, I have broken this down into two parts.

    As you probably know, every organization needs to evolve itself becoming a Tech Company, leveraging its own Digital Transformation, embracing or enhancing existing processes, evolving people’s mindset, people’s soft/hard skills and of course using new technologies.

    Remember we are living in a digital era where if you don’t change yourself and your organization someone will disrupt your business!

    So, how can I become disruptive in my business?

    Well, speaking from a purely technical perspective a good approach should consider cloud technologies.

    These kinds of technologies can be the first brick of your digital transformation strategy because they can grant business and technologies values.

    For instance, you could:

    • Build new services on demand in a fast way
    • Doing it in a self-service fashion
    • Reduce manual activities
    • Scale services and workloads
    • Respond to requests peak
    • Grants services quality and SLAs to your customers
    • Respond to customer demands in a short way
    • Reduce time to market
    • Improve TCO

    In my experience as a Solution Architect, I saw many different kinds of approaches when you want to build your Cloud.

    For sure, you need to identify your business needs, then evaluate your use cases, build your user stories and then start thinking about tech stuff and their impacts.

    Remember, don’t start approaching this kind of project from a technical perspective. There is nothing worse. My suggestion is, start thinking about values you want to deliver to your customers/end users.

    Usually, the first technical questions coming to customers’ minds are:

     What kind of services would I like to offer?

    • What use cases do I need to manage?
    • Do I need a public cloud, a private cloud, or a mix of them (hybrid)?
    • Do I like to use an instance based environment (Iaas) or containers based (Paas for friends) or a Saas?
    • How is complex is it to build and manage a cloud environment?
    • What products will help my organization to reach these objectives?

    The good news is that Red Hat can help you with amazing solutions, people, and processes.

    Hands-on – Part 1

    Now let’s start thinking about a cloud environment based on an instance/VM concept.

    Here, we are speaking about Red Hat OpenStack Platform 11 (RHOSP).

    Now I don’t want to explain in details all modules available in RHOSP 11 but basically, just to give you a brief overview, you can see them in the next figure.

    Picture 1

    In order to build an SDDC in a fast, fully automated way, in this blog, you’ll see how you can reach the goal using almost all these modules with a focus on the Orchestration Engine HEAT.

    HEAT is very powerful orchestration engine that it’s able to build from a single instance to a very complex architecture composed by instances, networks, routers, load balancers, security groups, storage volumes, floating IPs, alarms, and more or less all objects managed by OpenStack.

    The only thing you have to do is start to write a Heat Orchestration Template (HOT) written in yaml [1] and ask HEAT to execute it for you.

    HEAT will be our driver to build our software-defined data center managing all needed OpenStack components leveraging all your requests.

    So, the first thing is to build a HOT template right? Well, let’s start cloning my git repo:

    This heat stack will create a three-tier application with 2 web servers, 2 app servers, 1 db, some dedicated private network segments, virtual routers to interconnect private segments to floating IPs and to segregate networks. Two lbaas v2 (one for the FE and 1 for the APP layer), auto scaling groups, cinder-volumes (boot-from-volumes), ad hoc security groups, and Aodh alarms scale up/scale down policies and so on and so forth.

    What? Yes, all these stuff 🙂

    In this example, web servers will run httpd 2.4, app servers will load just a simple python http server on port 8080 and db server right now is a placeholder.

    In the real world, of course, you will take care of installing and configuring your application servers and db servers in an automatic, reproducible, and idempotent way with heat or for instance with an ansible playbook.

    In this repo you’ll find:

    • stack-3tier.yaml
        • The main HOT template which defines the skeleton of our stack (input parameters, resources type, and output).
      • lb-resource-stack-3tier.yaml
        • Hot template to configure our lbaas v2 (load balancer as a service: namespace based HA proxy). This file will be retrieved by the main HOT template via http.
      • Bash script to perform:
        • OpenStack project creation
        • User management
        • Heat stack creation under pre-configured OpenStack tenant
        • Deletion of previous points (in case you need it)

    As prerequisites to build your environment, you need to prepare:

    A laptop or Intel NUC with RHEL 7.X + kvm able to host two virtual machines (1 all-in-one OpenStack vm and 1 CloudForms vm).

    I suggest using a server with 32 GB of ram and at least 250 GB of SSD disk.

    • OpenStack 11 all-in-one VM on rhel 7.4  (installed with packstack usable ONLY for test/demo purpose) with:
      • 20 GB of ram
      • 4-8 vcpu ⇒ better 8 🙂
      • 150 GB of disk (to store cinder-volumes as a file)
      • 1 vnic (nat is ok)
      • Pre-configured external shared network available for all projects

    OpenStack network create –share –external –provider-network-type flat \  –provider-physical-network extent FloatingIpWeb \

    OpenStack subnet create –subnet-range \ –allocation-pool start=,end= –no-dhcp –gateway –network FloatingIpWeb –dns-nameserver \ FloatingIpWebSubnet

      • Rhel 7.4 images loaded on glance and available for your projects/tenants (public).
      • Apache2 image loaded on glance (based on rhel 7.4 + httpd installed and enabled). You’ll have to define a virtual host pointing to your web server Document Root.
      • A dedicated flavor called “x1.xsmall” with 2 GB of ram, 1 vcpu and 10 GB of disk.
    • A new Apache virtual host on the rhosp vm to host our load balancer yaml file.

    root@osp conf.d(keystone_admin)]# cat /etc/httpd/conf/ports.conf  | grep 8888

    Listen 8888

    [root@osp conf.d(keystone_admin)]# cat /etc/httpd/conf.d/heatstack.conf

    <VirtualHost *:8888>

       ServerName osp.test.local


       DocumentRoot /var/www/html/heat-templates/

       ErrorLog /var/log/httpd/heatstack_error.log

       CustomLog /var/log/httpd/heatstack_requests.log combined


    • Iptables firewall rule to permit network traffic to tcp port 8888.

    iptables -I INPUT 13 -p tcp -m multiport –dports 8888 -m comment –comment “heat stack 8888 port retrieval” -j ACCEPT

    That’s all.

    Let’s clone the git repo, doing some modifications and start.

    • Modify stack-3tier.yaml
      • Uncomment tenant_id rows

    If you are executing the heat stack through the bash script, don’t worry. will take care of updating the tenant_id parameter.

    Otherwise, if you are executing it through CloudForms or manually via heat please update the tenant_id accordingly to your environment configuration.

      • Modify management_network and web_provider_network pointing out your floating ip networks. If you are using just a single external network, you can put here the same value. In a true production environment, you’ll probably use more than one external networks with different floating ip pools.
      • Modify str_replace of web_asg (autoscaling group for our web servers) accordingly to what you want to modify on your web-landing page.

    We’ll see later why I’ve done some little modification using str_replace.  🙂

    Let’s source our keystonerc_admin (so as admin) and run our bash script on our rhosp server:

    root@osp ~]# source /root/keystonerc_admin

    [root@osp heat-templates(keystonerc_admin)]# bash -x create

    After the automatic creation of the tenant (demo-tenant) you’ll see in the output that heat is creating our resources.

    2017-10-05 10:06:14Z [demo-tenant]: CREATE_IN_PROGRESS  Stack CREATE started

    2017-10-05 10:06:15Z [demo-tenant.web_network]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:16Z [demo-tenant.web_network]: CREATE_COMPLETE  state changed

    2017-10-05 10:06:16Z [demo-tenant.boot_volume_db]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:17Z [demo-tenant.web_subnet]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:18Z [demo-tenant.web_subnet]: CREATE_COMPLETE  state changed

    2017-10-05 10:06:18Z [demo-tenant.web-to-provider-router]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:19Z [demo-tenant.internal_management_network]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:19Z [demo-tenant.internal_management_network]: CREATE_COMPLETE  state changed

    2017-10-05 10:06:20Z [demo-tenant.web_sg]: CREATE_IN_PROGRESS  state changed

    2017-10-05 10:06:20Z [demo-tenant.web-to-provider-router]: CREATE_COMPLETE  state changed

    2017-10-05 10:06:20Z [demo-tenant.web_sg]: CREATE_COMPLETE  state changed

    2017-10-05 10:06:21Z [demo-tenant.management_subnet]: CREATE_IN_PROGRESS  state changed

    … Output truncated

    You can also login to the Horizon dashboard and check the status of the heat stack from Orchestration -> Stack

    Picture 2

    Clicking on our stack name you will be able to see also all resources managed by heat and their current status (Picture 2).

    Picture 3

    After 10-12 minutes, your heat stack will be completed. In a production environment, you’ll reach the same goal in 2 minutes! Yes, 2 minutes in order to have a fully automated software-defined data center!

    Let’s check what our stack has deployed going to Network Topology tab.

    Cool! Everything was deployed as expected.

    Picture 4

    Now you are probably wondering what kind of services are managed by this environment.

    Let’s see if Lbaas which are exposing our services is up and running:

    [root@osp ~(keystone_admin)]# neutron lbaas-loadbalancer-list  -c name -c vip_address  -c provisioning_status

    Neutron CLI is deprecated and will be removed in the future. Use OpenStack CLI instead.


    | name                                           | vip_address  | provisioning_status |


    | demo-3tier-stack-app_loadbalancer-kjkdiehsldkr |  | ACTIVE  |

    | demo-3tier-stack-web_loadbalancer-ht5wirjwcof3 | | ACTIVE  |


    Now get the floating ip address associated to our web lbaas.

    [root@osp ~(keystone_admin)]# OpenStack floating ip list -f value -c “Floating IP Address” -c “Fixed IP Address” | grep

    So, our external floating IP is

    Let’s see what is exposing 🙂

    Wow, Red Hat Public website hosted our OpenStack!

    Picture 5

    I have cloned our Red Hat company site as a static website so our app and db server is not really exposing a service but of course, you can extend this stack installing on your app/db server whatever is needed to expose your services.

    Now, I want to show you that this website is really running on our instances so let’s scroll the page until the footer

    Picture 6

    Here you can see what instance is presenting you the website:

    I am web-eg4q.novalocal created on Thu Oct 5 12:11:33 EDT 2017

    Doing a refresh, our Lbaas will perform a round robin to other web instances;

    I am web-9hhm.novalocal created on Thu Oct 5 12:12:51 EDT 2017

    This is why I suggested to you to modify str_replace accordingly to what you want to modify on your web page.

    In this case, I’ve changed the footer to clearly show you what server is answering to our http requests.


    Download this Kubernetes cheat sheet for automating deployment, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure.


    The post Build your Software Defined Data Center with Red Hat CloudForms and Openstack – part 1 appeared first on RHD Blog.

    by Michele Naldini at November 02, 2017 11:00 AM