Planet RDO

May 22, 2018

Red Hat Stack

“Ultimate Private Cloud” Demo, Under The Hood!

At the recent Red Hat Summit in San Francisco, and more recently the OpenStack Summit in Vancouver, the OpenStack engineering team worked on some interesting demos for the keynote talks.

I’ve been directly involved with the deployment of Red Hat OpenShift Platform on bare metal using the Red Hat OpenStack Platform director deployment/management tool, integrated with openshift-ansible. I’ll give some details of this demo, the upstream TripleO features related to this work, and insight around the potential use-cases.

TripleO & Ansible, a Powerful Combination!

For anyone that’s used Red Hat OpenStack Platform director (or the upstream TripleO project, upon which it is based), you’re familiar with the model of deploying a management node (“undercloud” in TripleO terminology), then deploying and managing your OpenStack nodes on bare metal.  However, TripleO also provides a very flexible and powerful combination of planning, deployment, and day-2 operations features. For instance, director allows us to manage and provision bare metal nodes, then deploy virtually any application onto those nodes via Ansible!

The “undercloud” management node makes use of several existing OpenStack services, including Ironic for discovery/introspection and provisioning of bare metal nodes, Heat, a declarative orchestration tool, and Mistral, a workflow engine.  It also provides a convenient UI, showcased in the demo, along with flexible CLI interfaces and standard OpenStack ReST APIs for automation.

As described in the demo, director has many useful features for managing your hardware inventory – you can either register or auto-discover your nodes, then do introspection (with optional benchmarking tests) to discover the hardware characteristics via the OpenStack ironic-inspector service.  Nodes can then be matched to a particular profile either manually or via rules implemented through the OpenStack Mistral workflow API. You are then ready to deploy an Operating System image onto the nodes using the OpenStack Ironic “bare metal-as-a-service” API.

When deciding what will be deployed onto your nodes, director has the concept of a “deployment plan,” which combines specifying which nodes/profiles will be used and which configuration will be applied, known as “roles” in TripleO terminology.

This is a pretty flexible system enabling a high degree of operator customization and extension through custom roles where needed, as well as supporting network isolation and custom networks (isolated networks for different types of traffic), declarative configuration of  network interfaces, and much more!

Deploying Red Hat OpenShift Container Platform on bare metal

What was new in the Summit demo was deploying OpenShift alongside OpenStack, both on bare metal, and both managed by Red Hat OpenStack Platform  director. Over the last few releases we’ve made good progress on ansible integration in TripleO, including enabling integration with “external” installers.  We’ve made use of that capability here to deploy OpenShift via TripleO, combining the powerful bare-metal management capabilities of TripleO with existing openshift-ansible management of configuration.

Integration between Red Hat OpenStack Platform and Red Hat OpenShift Container Platform

Something we didn’t have time to get into in great detail during the demo was the potential for integration between OpenStack and OpenShift – if you have an existing Red Hat OpenStack Platform deployment you can choose to deploy OpenShift with persistent volumes backed by Cinder (the OpenStack block storage service). And for networking integration, the Kuryr project, combined with OVN from OpenvSwitch, enables the sharing of a common overlay network between both platforms, without the overhead of double encapsulation.

This makes it easy to add OpenShift managed containers to your infrastructure, while almost seamlessly integrating them with VM workloads running on OpenStack. You can also take advantage of existing OpenStack capacity and vendor support while using the container management capabilities of OpenShift.

Container-native virtualization

After we deployed OpenShift we saw some exciting demos focussed on workloads running on OpenShift, including a preview of the new container native virtualization (CNV) feature. CNV uses the upstream KubeVirt project to run Virtual Machine (VM) workloads directly on OpenShift.

Unlike the OpenShift and OpenStack combination described above, here OpenShift manages the VM workloads, providing an easier way to  transition your VM workloads where no existing virtualization solution is in place. The bare-metal deployment capabilities outlined earlier are particularly relevant here, as you may want to run OpenShift worker nodes that host VMs on bare metal for improved performance. As the demo has shown,  the combination of director and openshift-ansible makes deploying, managing, and running OpenShift and OpenStack easier to achieve!


by Steven Hardy, Senior Principal Software Engineer at May 22, 2018 11:56 PM

May 18, 2018

Carlos Camacho

Testing Undercloud backup and restore using Ansible

Testing the Undercloud backup and restore

It is possible to test how the Undercloud backup and restore should be performed using Ansible.

The following Ansible playbooks will show how can be used Ansible to test the backups execution in a test environment.

Creating the Ansible playbooks to run the tasks

Create a yaml file called uc-backup.yaml with the following content:

- hosts: localhost
  - name: Remove any previously created UC backups
    shell: |
      source ~/stackrc
      openstack container delete undercloud-backups --recursive
    ignore_errors: True
  - name: Create UC backup
    shell: |
      source ~/stackrc
      openstack undercloud backup --add-path /etc/ --add-path /root/

Create a yaml file called uc-backup-download.yaml with the following content:

- hosts: localhost
  - name: Print destroy warning.
      msg: |
        We are about to destroy the UC, as we are not
        moving outside the UC the backup tarball, we will
        download it and unzip it in a temporary folder to
        recover the UC using those files.
      msg: ""
  - name: Make sure the temp folder used for the restore does not exist
    become: true
      path: "/var/tmp/test_bk_down"
      state: absent
  - name: Create temp folder to unzip the backup
    become: true
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"
  - name: Download the UC backup to a temporary folder (After breaking the UC we won't be able to get it back)
    shell: |
      source ~/stackrc
      cd /var/tmp/test_bk_down
      openstack container save undercloud-backups
  - name: Unzip the backup
    become: true
    shell: |
      cd /var/tmp/test_bk_down
      tar -xvf UC-backup-*.tar
      gunzip *.gz
      tar -xvf filesystem-*.tar
  - name: Make sure stack user can get the backup files
    become: true
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"

Create a yaml file called uc-destroy.yaml with the following content:

- hosts: localhost
  - name: Remove mariadb
    become: true
    yum: pkg=
      - mariadb
      - mariadb-server
  - name: Remove files
    become: true
      path: ""
      state: absent
      - /root/.my.cnf
      - /var/lib/mysql

Create a yaml file called uc-restore.yaml with the following content:

- hosts: localhost
    - name: Install mariadb
      become: true
      yum: pkg=
        - mariadb
        - mariadb-server
    - name: Restart MariaDB
      become: true
      service: name=mariadb state=restarted
    - name: Restore the backup DB
      shell: cat /var/tmp/test_bk_down/all-databases-*.sql | sudo mysql
    - name: Restart MariaDB to perms to refresh
      become: true
      service: name=mariadb state=restarted
    - name: Register root password
      become: true
      shell: cat /var/tmp/test_bk_down/root/.my.cnf | grep -m1 password | cut -d'=' -f2 | tr -d "'"
      register: oldpass
    - name: Clean root password from MariaDB to reinstall the UC
      shell: |
        mysqladmin -u root -p password ''
    - name: Clean users
      become: true
      mysql_user: name="" host_all="yes" state="absent"
        - ceilometer
        - glance
        - heat
        - ironic
        - keystone
        - neutron
        - nova
        - mistral
        - zaqar
    - name: Reinstall the undercloud
      shell: |
        openstack undercloud install

Running the Undercloud backup and restore tasks

To test the UC backup and restore procedure, run from the UC after creating the Ansible playbooks:

  # This playbook will create the UC backup
  ansible-playbook uc-backup.yaml
  # This playbook will download the UC backup to be used in the restore
  ansible-playbook uc-backup-download.yaml
  # This playbook will destroy the UC (remove DB server, remove DB files, remove config files)
  ansible-playbook uc-destroy.yaml
  # This playbook will reinstall the DB server, restore the DB backup, fix permissions and reinstall the UC
  ansible-playbook uc-restore.yaml

Checking the Undercloud state

After finishing the Undercloud restore playbook the user should be able to execute again any CLI command like:

  source ~/stackrc
  openstack stack list

Source code available in GitHub

by Carlos Camacho at May 18, 2018 12:00 AM

May 11, 2018

Groningen Rain

Your LEGO® Order Has Been Shipped

In preparation for the Red Hat Summit this week and OpenStack Summit in a week, I put together a hardware demo to sit in the RDO booth.

I know, I know – the title has LEGO in it and now I’m talking tech.

Bait and switch, AMIRITE?!?

I promise it’s relevant.

So I put together this little hardware demo…

It ended up being two NUCs – one provisioning the other to build an all-in-one cloud using TripleO quickstart.

I didn’t use all of the hardware in the original demo and this is something I’d ultimately like to do after it all ships back to me.

Originally it was a build with one router for the public network, one switch for the private network, four NUCs – one to provision, one undercloud, one overcloud compute node and one overcloud controller and all the necessary networking and power cables.

Then it evolved to include a Pine64 to demo power management, but that doesn’t actually belong to our project, so I need to return it to its owner in June.

Anyway, LEGOs, RIGHT?!?

Long version longer is that I wanted to rebuild this demo AND build a lego NUC rack, too.

I found instructions on the web that looked simple enough and includes every single brick needed to build a four NUC rack.

If you scroll down to the Bill of Materials, it’s… detailed.

I ran out of time for these events, but it’s something that’s still on my mind for future events, so this week I started ordering the bricks.

And HOLY FREE HOLY I had to order parts from FOUR DIFFERENT STORES.

Thankfully, I could get most of the bricks from LEGO pick a brick despite it being IMPOSSIBLE to search for specific individual bricks. Then, I got the stackable plates from Strictly Bricks.

Then the continuous arches from an obscure shop in the Netherlands. And the roof bricks which are RETIRED by LEGO from ANOTHER obscure shop in the Netherlands.

And the last two I’m not even linking because I don’t remember how I found them or if I actually remembered to ORDER those parts because it took hours of frustrating, painstaking time to find what I did find and I think by the end I totally forgot more than a few things.

Can you tell I ran into some issues?

That I’m frustrated?

But there’s another side of me that’s completely totally OVER THE MOON cause I get to play with LEGO for my JOB.

And, thankfully, that’s the bigger part.


by rainsdance at May 11, 2018 08:00 AM

May 10, 2018

Groningen Rain

And Then I Realized I Hadn’t Posted Yet

I completely forgot to write until eleven o’clock at night and now it’s past time for bed and I haven’t written.

This is how you get crap.

Or is this Not Crap ™?

I have to share this bit that happened earlier tonight. I was remotely managing the RDO portion of the RDO / ManageIQ / Ceph booth at Red Hat Summit – we were working on the hardware demo. At some point yesterday it borked out and needed to be reinstalled to work.

I was on irc with the OSAS Ambassador and she said, effectively, “I can’t do this install, I don’t have my guru.”

And without hesitation.

I replied.

“I am your guru.”

I can’t believe I wrote that because it’s so goddamn egotistical. And I’m totally wincing at the pompous ego that typed those words.

But simultaneously?



I know this stuff.

And if I run into something that I haven’t run into before, I know how to figure it out.


by rainsdance at May 10, 2018 08:00 AM

May 09, 2018

OpenStack In Production (CERN)

Introducing GPUs to the CERN Cloud

High-energy physics workloads can benefit from massive parallelism -- and as a matter of fact, the domain faces an increasing adoption of deep learning solutions. Take for example the newly-announced TrackML challenge [7], already running in Kaggle! This context motivates CERN to consider GPU provisioning in our OpenStack cloud, as computation accelerators, promising access to powerful GPU computing resources to developers and batch processing alike.

What are the options?

Given the nature of our workloads, our focus is on discrete PCI-E Nvidia cards, like the GTX1080Ti and the Tesla P100. There are 2 ways to provision these GPUs to VMs: PCI passthrough and virtualized GPU. The first method is not specific to GPUs, but applies to any PCI device. The device is claimed by a generic driver, VFIO, on the hypervisor (which cannot use it anymore) and exclusive access to it is given to a single VM [1]. Essentially, from the host’s perspective the VM becomes a userspace driver [2], while the VM sees the physical PCI device and can use normal drivers, expecting no functionality limitation and no performance overhead.
Visualizing passthrough vs mdev vGPU [9]
In fact, perhaps some “limitation in functionality” is warranted, so that the untrusted VM can’t do low-level hardware configuration changes on the passed-through device, like changing power settings or even its firmware! In fact, security-wise PCI passthrough leaves a lot to be desired. Apart from allowing the VM to change the device’s configuration, it might leave a possibility for side-channel attacks on the hypervisor (although we have not observed this, and a hardware “IOMMU” protects against DMA attacks from the passed-through device). Perhaps more importantly, the device’s state won’t be automatically reset after deallocating from a VM. In the case of a GPU, data from a previous use may persist on the device’s global memory when it is allocated to a new VM. The first concern may be mitigated by improving VFIO, while the latter, the issue of device reset or “cleanup”, provides a use case for a more general accelerator management framework in OpenStack -- the nascent Cyborg project may fit the bill.
Virtualized GPUs are a vendor-specific option, promising better manageability and alleviating the previous issues. Instead of having to pass through entire physical devices, we can split physical devices into virtual pieces on demand (well, almost on demand; there needs to be no vGPU allocated in order to change the split) and hand out a piece of GPU to any VM. This solution is indeed more elegant. In Intel and Nvidia’s case, virtualization is implemented as a software layer in the hypervisor, which provides “mediated devices” (mdev [3]), virtual slices of GPU that appear like virtual PCI devices to the host and can be given to the VMs individually. This requires a special vendor-specific driver on the hypervisor (Nvidia GRID, Intel GVT-g), unfortunately not yet supporting KVM. AMD is following a different path, implementing SR-IOV at a hardware level.

CERN’s implementation

PCI passthrough has been supported in Nova for several releases, so it was the first solution we tried. There is a guide in the OpenStack docs [4], as well as previous summit talks on the subject [1]. Once everything is configured, the users will see special VM flavors (“g1.large”), whose extra_specs field includes passthrough of a particular kind of gpu. For example, to deploy a GTX 1080Ti, we use the following configuration:
add PciPassthroughFilter to enabled/default filters
flavor extra_specs
--property "pci_passthrough:alias"="nvP1080ti_VGA:1,nvP1080ti_SND:1"
A detail here is that most GPUs appear as 2 pci devices, the VGA and the sound device, both of which must be passed through at the same time (they are in the same IOMMU group; basically an IOMMU group [6] is the smallest passable unit).
Our cloud was in Ocata at the time, using CellsV1, and there were a few hiccups, such as the Puppet modules not parsing an option syntax correctly (MulitStrOpt) and CellsV1 dropping the pci requests. For Puppet, we were simply missing some upstream commits [15]. From Pike on and in CellsV2, these issues shouldn’t exist. As soon as we had worked around them and puppetized our hypervisor configuration, we started offering cloud GPUs with PCI passthrough and evaluating the solution. We created a few GPU flavors, following the AWS example of keeping the amount of vCPUs the same as the corresponding normal flavors.
From the user’s perspective, there proved to be no functionality issues. CUDA applications, like TensorFlow, run normally; the users are very happy that they finally have exclusive access to their GPUs (there is good tenant isolation). And there is no performance penalty in the VM, as measured by the SHOC benchmark [5] -- admittedly quite old, we preferred this benchmark because it also measures low-level details, apart from just application performance.
From the cloud provider’s perspective, there’s a few issues. Apart from the potential security problems identified before, since the hypervisor has no control over the passed-through device, we can’t monitor the GPU. We can’t measure its actual utilization, or get warnings in case of critical events, like overheating.
Normalized performance of VMs vs. hypervisor on some SHOC benchmarks. First 2: low-level gpu features, Last 2: gpu algorithms [8]. There are different test cases of VMs, to check if other parameters play a role. The “Small VM” has 2 vCPUs, “Large VM” has 4, “Pinned VM” has 2 pinned vCPUs (thread siblings), “2 pin diff N” and “2 pin same N” measure performance in 2 pinned VMs running simultaneously, in different vs the same NUMA nodes

Virtualized GPU experiments

The allure of vGPUs amounts largely to finer-grained distribution of resources, less security concerns (debatable) and monitoring. Nova support for provisioning vGPUs is offered in Queens as an experimental feature. However, our cloud is running on KVM hypervisors (on CERN CentOS 7.4 [14]), which Nvidia does not support as of May 2018 (Nvidia GRID v6.0). When it does, the hypervisor will be able to split the GPU into vGPUs according to one of many possible profiles, such as in 4 or in 16 pieces. Libvirt then assigns these mdevs to VMs in a similar way to hostdevs (passthrough). Details are in the OpenStack docs at [16].
Despite this promise, it remains to be seen if virtual GPUs will turn out to be an attractive offering for us. This depends on vendors’ licensing costs (such as per VM pricing), which, for the compute-compatible offering, can be significant. Added to that is the fact that only a subset of standard CUDA is supported (not supported are the unified memory and “CUDA tools” [11], probably referring to tools like the Nvidia profiler). vGPUs are also oversubscribing the GPU’s compute resources, which can be seen in either a positive or negative light. On the one hand, this guarantees higher resource utilization, especially for bursting workloads, like developers. On the other hand, we may expect a lower quality of service [12].

And the road goes on...

Our initial cloud GPU offering is very limited, and we intend to gradually increase it. Before that, it will be important to address (or at least be conscious about) the security repercussions of PCI passthrough. But even more significant is to address GPU accounting in a straightforward manner, by enforcing quotas on GPU resources. So far we haven’t tested the case of GPU P2P, with multi-GPU VMs, which is supposed to be problematic [13].
Another direction we’ll be researching is offering GPU-enabled container clusters, backed by pci-passthrough VMs. It may be that, with this approach, we can emulate a behavior similar to vGPUs and circumvent some of the bigger problems with pci passthrough.


[5]: SHOC benchmark suite:
[11]: CUDA Unified memory and tooling not supported on Nvidia vGPU:

by Konstantinos Samaras-Tsakiris ( at May 09, 2018 05:55 PM

Groningen Rain

My Miles Are About to Skyrocket

I reached out to my colleague that used to do This Job and asked him, “What events do you typically travel to outside of the OpenStack Summits [0] / PTGs [1]?”

And he replied something witty and important and vital and I completely didn’t have logging enabled nor did I write it down because I’m made of awesome.

But I did write down what OpenStack Days [2] are the biggest ones that I should try to attend, if not this year, in the upcoming years.

OpenStack Days Israel [3]
OpenStack Days Benelux [4]
OpenStack Days Nordic [5]
OpenStack Days NYC [6]
OpenStack Days UK [7]

And then there’s and FOSDEM [8] and DevConf.CZ [9] and two Centos Dojos [10] that I’m helping plan.

Oh, RIGHT! Plus RDO Test Days [11] in Brno!

This week, in particular, I have severe FOMO because it’s Red Hat Summit. And then 21-24 May is OpenStack Summit Vancouver. I’m remotely managing both. I’ve done everything I can possibly do to prepare for both, now I can only sit from afar and wait.

And put out fires, as needed.

Which means I could really use some distractions, People. Therefore, I thought it’d be nice to look at all the places IT

This means, over the next year, POSSIBLY, I’ll be travelling to…..


June 2018
14-15 RDO Test Days Rocky M2 Brno Czech Republic

August 2018
2-3 RDO Test Days Rocky M3 Brno Czech Republic
?? OpenStack Days NYC New York USA (tentative!)

September 2018
6-7 RDO Test Days Rocky GA Brno Czech Republic
10-14 OpenStack PTG Denver Colorado USA
13 OpenStack Day Benelux Amsterdam Netherlands (conflicts with PTG – need to send someone else!)
?? OpenStack Days UK London United Kingdom (tentative!)

October 2018
09-10 OpenStack Days Nordic Stockholm Sweden
20 CentOS Dojo @CERN (tentative!)
?? OpenStack Days Israel Tel Aviv (tentative!)

November 2018
13-15 OpenStack Summit Berlin Germany

January 2019
?? Brno Czech Republic (tentative!)

February 2019
?? FOSDEM Brussels Belgium (tentative!)
?? OpenStack PTG APAC (tentative!)

And, boy, are my arms tired ALREADY.

[6] OpenStack Days NYC was called Openstack Days East in 2016, doesn’t appear to have happened last year and I can’t find information about it anywhere.
[7] OpenStack Days UK doesn’t have any information up for this year, but last year was at

by rainsdance at May 09, 2018 08:00 AM

May 08, 2018

Red Hat Stack

A modern hybrid cloud platform for innovation: Containers on Cloud with Openshift on OpenStack

Market trends show that due to long application life-cycles and the high cost of change, enterprises will be dealing with a mix of bare-metal, virtualized, and containerized applications for many years to come. This is true even as greenfield investment moves to a more container-focused approach.

Red Hat® OpenStack® Platform provides a solution to the problem of managing large scale infrastructure which is not immediately solved by containers or the systems that orchestrate them.

In the OpenStack world, everything can be automated. If you want to provision a VM, a storage volume, a new subnet or a firewall rule, all these tasks can be achieved using an easy to use UI or with a command line interface, leveraging Openstack API’s. All these infrastructure needs might require a ticket, some internal processing, and could take weeks. Now such provisioning could all be done with a script or a playbook, and could be completely automated. 

The applications and workloads can specify cloud resources to be provisioned and spun up from a definition file. This enables new levels of provision-as-you-need-it. As as demand increases, the infrastructure resources can be easily scaled! Operational data and meters can trigger and orchestrate new infrastructure provisioning automatically when needed.

On the consumption side, it is no longer a developer ssh’ing into a server and manually deploying an application server. Now, it’s simply run a few OpenShift commands, select from a list of predefined applications, language runtimes, databases, and then just have those resources provisioned, on top of the target infrastructure that was automatically provisioned and configured.

Red Hat OpenShift Container Platform gives you the ability to define an application from a single YAML file. This makes it convenient for a developer to share with other developers, allowing them to  launch an exact copy of that application, make code changes, and share it back. This capability is only possible when you have automation at this level.

Infrastructure and application platforms resources are now exposed differently in an easy and consumable way, and the days when you needed to buy a server, manually connect it to the network and install runtimes and applications manually are now very much a thing of the past.

With Red Hat OpenShift Container Platform on Red Hat OpenStack Platform you get:

A WORKLOAD DRIVEN I.T. PLATFORM: The underlying infrastructure doesn’t matter from a developer perspective. Container platforms exist to ensure the apps are the main focus. As a developer I only care about the apps and I want to have a consistent experience, regardless of the underlying infrastructure platform. Openstack provides this to Openshift.

DEEP PLATFORM INTEGRATION: Networks (kuryr), services (ironic, barbican, octavia), storage (cinder, ceph), installation (openshift-ansible) are all engineered to work together to provide the tightest integrations across the stack, right down to bare metal. All are based in Linux® and engineered in the open source community for exceptional performance

PROGRAMMATIC SCALE-OUT: OpenStack is 100% API driven across the infrastructure software stack. Storage, networking, compute VM’s or even bare metal all deliver the ability to scale out rapidly and programmatically. With scale under workloads, growth is easy.

ACROSS ANY TYPE OF INFRASTRUCTURE: OpenStack can utilise bare metal for virtualization or for direct consumption. It can interact with network switches and storage directly to ensure hardware is put to work for the workloads it supports.

FULLY MANAGED: Red Hat CloudForms and Red Hat Ansible Automation provide common tooling across multiple providers. Ansible is Red Hat’s automation engine for everything, and it’s present under the hood in Red Hat CloudForms. With Red Hat Openstack Platform, Red Hat CloudForms is deeply integrated into both the overcloud, the undercloud, and the container platform on top. Full stack awareness means total control. And our Red Hat Cloud Suite bundle of products provides access to OpenStack and OpenShift, as well as an array of supporting technologies. Red Hat Satellite, Red Hat Virtualization, Red Hat Insights, and even Red Hat CloudForms are included!

A SOLID FOUNDATION: All Red Hat products are co-engineered with Red Hat Enterprise Linux at their core. Fixes happen fast and accurately as all components of the stack are in unison and developmental harmony. Whether issues might lie at the PaaS, IaaS or underlying Linux layer, Red Hat will support you all the way!

Red Hat Services can help you accelerate your journey to Hybrid Cloud adoption, and realize the most value of best-of-breed open source technology platforms such as OpenShift on top of Openstack. Want to learn more about how we can help? Feel free to reach out to me directly for any questions, Or download our Containers on Cloud datasheet

by Stephane Lefrere at May 08, 2018 02:50 PM

May 04, 2018

RDO Blog

The RDO Community Represents at RedHat Summit, May 8-10

Over the past few weeks we’ve been gearing up for Red Hat Summit and now it’s almost here! We hope to see you onsite — there are so many great talks, events, and networking opportunities to take advantage of. From panels to general sessions to hands-on labs, chances are you’re going to have a hard time choosing which sessions to attend!

We’re particularly excited about the below talks, but the full schedule of talks related to RDO, RHOSP, TripleO, and Ceph is over here.

Once you’re sessioned-out, come swing by the RDO booth, shared with ManageIQ and Ceph to see our newly updated hardware demo.

OpenStack use cases: How businesses succeed with OpenStack

Have you ever wondered just how the world’s largest companies are using Red Hat OpenStack Platform?

In this session, we’ll look at some of the most interesting use cases from Red Hat customers around the world, and give you some insight into how they achieved their technical and digital successes. You’ll learn how top organizations have used OpenStack to deliver unprecedented value.

Date:Tuesday, May 8
Time:3:30 PM – 4:15 PM
Location:Moscone West – 2007

Red Hat OpenStack Platform: The road ahead

OpenStack has reached a maturity level confirmed by wide industry adoption and the amazing number of active production deployments, with Red Hat a lead contributor to the project. In this session, we’ll share where we are investing for upcoming releases.

Date:Tuesday, May 8
Time:4:30 PM – 5:15 PM
Location:Moscone West – 2007

Production-ready NFV at Telecom Italia (TIM)

Telecom Italia (TIM) is the first large telco in Italy to deploy OpenStack into production. TIM chose to deploy a NFV solution based on Red Hat OpenStack Platform for its cloud datacenter and put critical VNFs—such as vIMS and vEPC—into production for its core business services. In this session, learn how TIM worked with Red Hat Consulting, as well as Ericsson as VNF vendor and Accenture as system integration, to set up an end-to-end NFV environment that matched its requirements with complex features like DPDK, completely automated.

Date:Wednesday, May 9
Time:11:45 AM – 12:30 PM
Location:Moscone West – 2002

Scalable application platform on Ceph, OpenStack and Ansible

How do you take a Ceph environment providing Cinder block storage to Openstack from a handful of nodes in a PoC environment all the way up to an 800+ node production environment, while serving live applications? In this session we will have two customers talking about how they did this and lessons learned! At Fidelity, they learned a lot about scaling hardware, tuning Ceph parameters, and handling version upgrades (using Ansible automation!). Solera Holdings Inc committed to modernizing the way Applications are developed and deployed with the need for highly performant, redundant and cost effective Object-Storage grew tremendously. After an successful PoC with Ceph Storage, Red Hat was chosen as a solution partner due to their excellence in customer experience and support as well as expertise in Ansible, as Solera iwill automate networking equipment (Fabric, Firewalls, Load-balancers) .

In addition to committing to reducing expensive enterprise SAN storage Solera also decided to commit to a new Virtualization strategy and building up a new IaaS to tackle challenges such as DBaaS and leveraging its newly built SDS backend for OpenStack while using the new SDS capabilities via iscsi to meet existing storage demands on VMware.

Solera will share why they chose RedHat as partner, how it has impacted and benefited Developers and DevOps Engineers alike and where the road will be taking us. Come to this session to hear about how both Fidelity and Solera Holdings Inc did it and what benefits were learned along the way!

Date:Thursday, May 10
Time:1:00 PM – 1:45 PM
Location:Moscone West – 2007

Building and maintaining open source communities

Being successful in creating an open source community requires planning, measurements, and clear goals. However, it’s an investment that can pay off tenfold when people come together around a shared vision. Who are we targeting, how can we achieve these goals, and why does it matter to the bigger business strategy?

In this panel you’ll hear from Amye Scarvada (Gluster), Jason Hibbets (, Greg DeKoenigsberg (Ansible), and Leslie Hawthorn (Open source and standards, Red Hat) as they share first-hand experiences about how open source communities have directly attributed to the success of a product, as well as best practices to build and maintain these communities. It will be moderated by Mary Thengvall (Persea Consulting), who after many years of building community programs is now working with companies who are building out a developer relations strategy.

Date:Thursday, May 10
Time:2:00 PM – 2:45 PM
Location:Moscone West – 2007

by Mary Thengvall at May 04, 2018 09:02 PM

Consuming Kolla Tempest container image for running Tempest tests

Kolla project provides a docker container image for Tempest.

The provided container image is available in two formats for centos: centos-binary-tempest and centos-binary-source.

The RDO community rebuilds the container image in centos-binary format and pushes it to and to

The Tempest container image contains openstack-tempest and all available Tempest plugins in it.

The benefit of running Tempest tests from Tempest container is that, we do not need to install any Tempest package or Tempest plugin on the deployed cloud and keep the environment safe from dependency mismatch and updates.

In TripleO CI, we run Tempest tests using Tempest container images in tripleo-ci-centos-7-undercloud-containers job using featureset027 set.

We can consume the same image for running Tempest tests locally in TripleO deployment:

  • Follow this link for installing containerized undercloud.

    Note: At step 5 in the above link, open undercloud.conf in an editor and set

    enable_tempest = true.

    It will pull the tempest container image on the undercloud.

  • If tempest container is not available on the undercloud, we pull the image from Dockerhub.

    $ sudo docker pull

  • Create two directories: container_tempest and tempest_workspace and copy stackrc, overcloudrc, tempest-deployer-input.conf, whitelist and blacklist related files to container_tempest. These files should be copied from undercloud to the container. Below commands do the same:

    $ mkdir container_tempest tempest_workspace

    $ cp stackrc overcloudrc tempest-deployer-input.conf whitelist.txt blacklist.txt container_tempest

  • Creating alias for running tempest within a container and with mounted container_tempest:/home/stack and tempest:/home/stack

    $ alias docker-tempest="sudo docker run -i \
    -v container_tempest:/home/stack \
    -v tempest:/home/stack \ \

  • Create tempest workspace using docker-tempest alias

    $ docker-tempest tempest init /home/stack/tempest

  • List tempest plugins installed within tempest container

    $ docker-tempest tempest list-plugins

  • Generate tempest.conf using discover-tempest-config

    Note: If tempest tests are running against undercloud then:

    $ source stackrc
    $ docker-tempest discover-tempest-config --create \
    --out /home/stack/tempest/etc/tempest.conf

    Note: If tempest tests are running against overcloud then:

    $ source overcloudrc
    $ docker-tempest discover-tempest-config --create \
    --out /home/stack/tempest/etc/tempest.conf \
    --deployer-input /home/stack/tempest-deployer-input.conf

  • Running tempest tests

    $ docker-tempest tempest run --workspace tempest \
    -c /home/stack/tempest/etc/tempest.conf \
    -r <tempest test regex> --subunit

    In the above command:

    • --workspace : To use tempest workspace
    • -c : Use the tempest.conf file
    • -r : To run tempest tests
    • --subunit: to generate tempest tests results subunit stream in v2 format

Once tests are finished, we can find the test output in /home/stack/tempest folder.

Thanks to Kolla team, Emilien, Wes, Arx, Martin, Luigi, Andrea, Ghanshyam, Alex, Sagi, Gabriel and RDO team for helping me in getting things in place.

Happy Hacking!

by chkumar246 at May 04, 2018 09:47 AM

Running Tempest tests against a TripleO Undercloud

Tempest is the integration test suite used to validate any deployed OpenStack cloud.

TripleO undercloud is the all-in-one OpenStack installation that includes components for provisioning and managing the OpenStack nodes that form your OpenStack environment (the overcloud).

For validating undercloud using Tempest, Follow the below steps:

  • Using tripleo-quickstart:

    • Follow this link to provision a libvirt guest environment through tripleo-quickstart
    • Deploy the undercloud and run Tempest tests on undercloud against undercloud

      $ bash -R master --no-clone --tags all \
      --nodes config/nodes/1ctlr_1comp.yml \
      -I --teardown none -p quickstart-extras-undercloud.yml \
      --extra-vars test_ping=False \
      --extra-vars tempest_undercloud=True \
      --extra-vars tempest_overcloud=False \
      --extra-vars run_tempest=True \
      --extra-vars test_white_regex='tempest.api.identity|tempest.api.compute' \

      The above command will:

      • Deploy an undercloud
      • Generate script in /home/stack folder
      • Run test_white_regex tempest tests
      • Store all the results in /home/stack/tempest folder.
  • Running Tempest tests manually on undercloud:

    • Deploy the undercloud manually by following this link and then ssh into undercloud.
    • Install openstack-tempest rpm on undercloud

      $ sudo yum -y install openstack-tempest

    • Source stackrc on undercloud

      $ source stackrc

    • Append Identity API version in $OS_AUTH_URL

      $OS_AUTH_URL defined in stackrc does not contain the Identity API version, what will lead to a failure while generating tempest.conf using python-tempestconf. In order to fix the above issue, we need to append the API version to the OS_AUTH_URL environment variable and export it.


    • Create the Tempest workspace

      $ tempest init <tempest_workspace>

    • Generate Tempest configuration using python-tempestconf

      $ cd <path to the tempest_workspace>

      $ discover-tempest-config --create --out etc/tempest.conf

      The above command will generate tempest.conf in /etc/ directory.

    • Now we are all set to run Tempest tests. Run the following command to run Tempest tests

      $ tempest run -r '(tempest.api.identity|tempest.api.compute)' --subunit

      The above command will:

      • Run tempest.api.identity and tempest.api.compute tests
      • All the Tempest test subunit results in v2 format will be stored in .stestr directory under Tempest workspace.
    • Use subunit2html command to generate results in html format

      $ sudo yum -y install python-subunit

      $ subunit2html <path to tempest workspace>/.stestr/0 tempest.html

And we are done with running Tempest on undercloud.

Currently tripleo-ci-centos-7-undercloud-oooq job is running Tempest tests on undercloud in TripleO CI using featureset003

Thanks to Emilien, Enrique, Wes, Arx, Martin, Luigi, Alex, Sagi, Gabriel and RDO team for helping me in getting things in place.

Happy Hacking!

by chkumar246 at May 04, 2018 08:22 AM

May 03, 2018

Lars Kellogg-Stedman

Using a TM1637 LED module with CircuitPython

CircuitPython is "an education friendly open source derivative of MicroPython". MicroPython is a port of Python to microcontroller environments; it can run on boards with very few resources such as the ESP8266. I've recently started experimenting with CircuitPython on a Wemos D1 mini, which is a small form-factor ESP8266 board …

by Lars Kellogg-Stedman at May 03, 2018 04:00 AM


ARA Records Ansible 0.15 has been released

I was recently writing that ARA was open to limited development for the stable release in order to improve the performance for larger scale users. This limited development is the result of this 0.15.0 release. The #OpenStack community runs over 300,000 continuous integration jobs with #Ansible every month with the help of the awesome Zuul. Learn more about scaling ARA reports with @dmsimard — OpenStack (@OpenStack) April 18, 2018 Changelog for ARA Records Ansible 0.

May 03, 2018 12:00 AM

May 02, 2018

RDO Blog


Bust out your spoons cause we’re about to test the first batch of Rocky [road] ice cream!

Or, y’know, the first Rocky OpenStack milestone.

On 03 and 04 May, TOMORROW and Friday, we’ll have our first milestone test days for Rocky OpenStack. We would love to get as wide participation in the RDO Test Days from our global team as possible!

We’re looking for developers, users, operators, quality engineers, writers, and, yes, YOU. If you’re reading this, we want your help!

Let’s set new records on the amount of participants! The amount of bugs! The amount of feedback and questions and NOTES!

Oh, my.

But, seriously.

I know that everyone has Important Stuff To Do but taking a few hours or a day to give things a run through at various points in the RDO cycle will benefit everyone. Not only will this help identify issues early in the development process, but you can be the one of the first to cut your teeth on the latest versions with your favorite deployment methods and environments like TripleO, PackStack, and Kolla.

So, please consider taking a break from your normal duties and spending at least a few hours with us in #rdo on Freenode.

And who knows – if we have enough interest, perhaps we’ll have ACTUAL rocky road ice cream at the next RDO Test Days.


by Rain Leander at May 02, 2018 11:34 AM

Groningen Rain

Rocky Road Ice Cream, People. The Best Ice Cream, Obviously.

Grab your spoons, people, the first milestone of OpenStack Rocky has come and gone which can mean only one thing!

RDO Test Days!

Wait, were you expecting ACTUAL ice cream?

The ice cream is a lie.

But RDO Test Days are HERE!

RDO is a community of people using and deploying OpenStack on CentOS, Fedora, and Red Hat Enterprise Linux. At each OpenStack development cycle milestone, the RDO community holds test days to invite people to install, deploy and configure a cloud using RDO and report feedback. This helps us find issues in packaging, documentation, installation and more but also, where appropriate, to collaborate with the upstream OpenStack projects to file and resolve bugs found throughout the event.

In order to participate, though, people needed to:

* Have hardware available to install and deploy on
* Be reasonably knowledgeable / familiar with OpenStack
* Have the time to go through an end-to-end installation, test it and provide feedback


In an attempt to eliminate these barriers, we’re continuing the experiment started last year by providing a ready to use cloud environment. This cloud will be deployed with the help of Kolla: and Kolla-Ansible: which will install a containerized OpenStack cloud with the help of Ansible and Docker.

The cloud will be built using 5 bare metal servers – three controllers and two compute nodes.

Would you like to participate? We’d love your help!

The next test days start TOMORROW – on the third and fourth of May we will test the first milestone of the latest OpenStack Rocky release.

To sign up to use the Kolla cloud environment and for more information, please visit

In the meantime, visit and join us on channel #rdo on Freenode irc where we’re available to answer any questions and troubleshoot.

And if we get enough interest, who knows?

Maybe we’ll get ACTUAL ice cream at future test days.


Me, too!

by rainsdance at May 02, 2018 08:00 AM

April 27, 2018

Red Hat Stack

Highlights from the OpenStack Rocky Project Teams Gathering (PTG) in Dublin

Last month in Dublin, OpenStack engineers gathered from dozens of countries and companies to discuss the next release of OpenStack. This is always my favorite OpenStack event, because I get to do interviews with the various teams, to talk about what they did in the just-released version (Queens, in this case) and what they have planned for the next one (Rocky).

If you want to see all of those interviews, they are on YouTube at:

( and I’m still in the process of editing and uploading them. So subscribe, and you’ll get notified as the new interviews go live.

In this article, I want to mention a few themes that cross projects, so you can get a feel for what’s coming in six months.

40533090551_0f0452cb1a_zThe interview chair. Photo: Author

I’ll start with my interview with Thierry Carrez. While it was the last interview I did, watching it first gives a great overview of the event, what was accomplished, why we do the event, and what it will look like next time. (Spoiler: We will have another PTG around early September, but are still trying to figure out what happens after that.)

One theme that was even stronger this time than past PTGs was cross-project collaboration. This is, of course, a long-time theme in OpenStack, because every project MUST work smoothly with others, or nothing works. But this has been extended to the project level with the introduction of new SIGs – Special Interest Groups. These are teams that focus on cross-project concepts such as scientific computing, APIs best practice, and security. You can read more about SIGs at

I spoke with two SIGs at the PTG, and I’ll share here the interview with Michael McCune from the OpenStack API SIG.

Another common theme in interviews this year was that while projects did make huge progress on the features front, there was also a lot of work in stabilizing and hardening OpenStack – making it more enterprise-ready, you might say. One of these efforts was the “fast forward upgrade” effort, which is about making it easy to upgrade several versions – say, from Juno all the way to Queens, for example – in one step.

39822693574_32ac7bf26b_zCroke Park, Dublin. PTG Venue! Photo: Author

Part of what makes that possible is the amazing work of the Zuul team, who develop and run the massive testing infrastructure that subjects every change to OpenStack code to a rigorous set of functional and and interdependence tests.

And I’ll share one final video with you before sending you off to watch the full list (Again, that’s at Over the years, OpenStack had a reputation of being hard to deploy and manage. This has driven development of the TripleO project, which is an attempt to make deployment easy, and management possible without knowing everything about everything. I did a number of TripleO interviews, because it’s a very large team, working on a diverse problem set.

26662148028_a9e1f8a763_zThe big storm during the PTG. Photo: Author

The video that I’ll choose here is with the OpenStack Validations team. This is the subproject that ensures that, when you’re deploying OpenStack, it checks everything that could go wrong before it has a chance to, so that you don’t waste your time.

There are many other videos that I haven’t featured here, and I encourage you to look at the list and pick the few interviews that are of most interest to you. I tried to keep them short, so that you can get the message without spending your entire day watching. But if you have any questions about any of them, take them to the OpenStack mailing list (Details at where these people will be more than happy to give you more detail.

About Rich Bowen

Rich is a community manager at Red Hat, where he works with the OpenStack and CentOS communities. Find him on Twitter at: @rbowen.


by Rich Bowen at April 27, 2018 02:05 AM

April 25, 2018

John Likes OpenStack

April 20, 2018

RDO Blog

Community Blogpost Round-up: April 20

The last month has been busy to say the least, which is why we haven’t gotten around to posting a recent Blogpost Roundup, but it looks like you all have been busy as well! Thanks as always for continuing to share your knowledge around RDO and OpenStack. Enjoy!

Lessons from OpenStack Telemetry: Deflation by Julien Danjou

This post is the second and final episode of Lessons from OpenStack Telemetry. If you have missed the first post, you can read it here.


Unit tests on RDO package builds by jpena

Unit tests are used to verify that individual units of source code work according to a defined spec. While this may sound complicated to understand, in short it means that we try to verify that each part of our source code works as expected, without having to run the full program they belong to.


Red Hatters To Present at More Than 50 OpenStack Summit Vancouver Sessions by Peter Pawelski, Product Marketing Manager, Red Hat OpenStack Platform

OpenStack Summit returns to Vancouver, Canada May 21-24, 2018, and Red Hat will be returning as well with as big of a presence as ever. Red Hat will be a headline sponsor of the event, and you’ll have plenty of ways to interact with us during the show.


Lessons from OpenStack Telemetry: Incubation by Julien Danjou

It was mostly around that time in 2012 that I and a couple of fellow open-source enthusiasts started working on Ceilometer, the first piece of software from the OpenStack Telemetry project. Six years have passed since then. I’ve been thinking about this blog post for several months (even years, maybe), but lacked the time and the hindsight needed to lay out my thoughts properly. In a series of posts, I would like to share my observations about the Ceilometer development history.


Comparing Keystone and Istio RBAC by Adam Young

To continue with my previous investigation to Istio, and to continue the comparison with the comparable parts of OpenStack, I want to dig deeper into how Istio performs RBAC. Specifically, I would love to answer the question: could Istio be used to perform the Role check?


Scaling ARA to a million Ansible playbooks a month by David Moreau Simard

The OpenStack community runs over 300 000 CI jobs with Ansible every month with the help of the awesome Zuul.


Comparing Istio and Keystone Middleware by Adam Young

One way to learn a new technology is to compare it to what you already know. I’ve heard a lot about Istio, and I don’t really grok it yet, so this post is my attempt to get the ideas solid in my own head, and to spur conversations out there.


Heading to Red Hat Summit? Here’s how you can learn more about OpenStack. by Peter Pawelski, Product Marketing Manager, Red Hat OpenStack Platform

Red Hat Summit is just around the corner, and we’re excited to share all the ways in which you can connect with OpenStack® and learn more about this powerful cloud infrastructure technology. If you’re lucky enough to be headed to the event in San Francisco, May 8-10, we’re looking forward to seeing you. If you can’t go, fear not, there will be ways to see some of what’s going on there remotely. And if you’re undecided, what are you waiting for? Register today. 


Multiple 1-Wire Buses on the Raspberry Pi by Lars Kellogg-Stedman

The DS18B20 is a popular temperature sensor that uses the 1-Wire protocol for communication. Recent versions of the Linux kernel include a kernel driver for this protocol, making it relatively convenient to connect one or more of these devices to a Raspberry Pi or similar device.


An Introduction to Fast Forward Upgrades in Red Hat OpenStack Platform by Maria Bracho, Principal Product Manager OpenStack

OpenStack momentum continues to grow as an important component of hybrid cloud, particularly among enterprise and telco. At Red Hat, we continue to seek ways to make it easier to consume. We offer extensive, industry-leading training, an easy to use installation and lifecycle management tool, and the advantage of being able to support the deployment from the app layer to the OS layer.


Ceph integration topics at OpenStack PTG by Giulio Fidente

I wanted to share a short summary of the discussions happened around the Ceph integration (in TripleO) at the OpenStack PTG.


Generating a list of URL patterns for OpenStack services. by Adam Young

Last year at the Boston OpenStack summit, I presented on an Idea of using URL patterns to enforce RBAC. While this idea is on hold for the time being, a related approach is moving forward building on top of application credentials. In this approach, the set of acceptable URLs is added to the role, so it is an additional check. This is a lower barrier to entry approach.


by Mary Thengvall at April 20, 2018 02:57 PM

April 19, 2018

Julien Danjou

Lessons from OpenStack Telemetry: Deflation

Lessons from OpenStack Telemetry: Deflation

This post is the second and final episode of Lessons from OpenStack Telemetry. If you have missed the first post, you can read it here.


At some point, the rules relaxed on new projects addition with the Big Tent initiative, allowing us to rename ourselves to the OpenStack Telemetry team and splitting Ceilometer into several subprojects: Aodh (alarm evaluation functionality) and Panko (events storage). Gnocchi was able to join the OpenStack Telemetry party for its first anniversary.

Finally being able to split Ceilometer into several independent pieces of software allowed us to tackle technical debt more rapidly. We built autonomous teams for each project and gave them the same liberty they had in Ceilometer. The cost of migrating the code base to several projects was higher than we wanted it to be, but we managed to build a clear migration path nonetheless.

Gnocchi Shamble

With Gnocchi in town, we stopped all efforts on Ceilometer storage and API and expected people to adopt Gnocchi. What we underestimated is the unwillingness of many operators to think about telemetry. They did not want to deploy anything to have telemetry features in the first place, so adding yet a new component (a timeseries database) to have proper metric features was seen a burden – and sometimes not seen at all.
Indeed, we also did not communicate enough on our vision for that transition. After two years of existence, many operators were asking what Gnocchi was and what they needed it for. They deployed Ceilometer and its bogus storage and API and were confused about needing yet another piece of software.

It took us more than two years to deprecate the Ceilometer storage and API, which is way too long.


In the meantime, people were leaving the OpenStack boat. Soon enough, we started to feel the shortage of human resources. Smartly, we never followed the OpenStack trend of imposing blueprints, specs, bug reports or any process to contributors, obeying my list of open source best practice. This flexibility allowed us to iterate more rapidly; compared to other OpenStack projects; we were going faster proportionately to the size of our contributor base.

Lessons from OpenStack Telemetry: Deflation

Nonetheless, we felt like bailing out a sinking ship. Our contributors were disappearing while we were swamped with technical debt: half-baked feature, unfinished migration, legacy choices and temporary hacks. After the big party that happened, we had to wash the dishes and sweep the floor.

Being part of OpenStack started to feel like a burden in many ways. The inertia of OpenStack being a big project was beginning to surface, so we put up a lot of efforts to dodge most of its implications. Consequently, the team was perceived as an outlier, which does not help, especially when you have to interact with a lot your neighbors.

The OpenStack Foundation never understood the organization of our team. They would refer to us as "Ceilometer" whereas we formally renamed ourselves to "Telemetry" since we were englobing four server projects and a few libraries. For example, while Gnocchi has been an OpenStack project for two years before leaving, it has never been listed on the project navigator maintained by the foundation.

That's a funny anecdote that demonstrates the peculiarity of our team, and how it has been both a strength and a weakness.


Nobody was trying to do what we were doing when we started Ceilometer. We filled the space of metering OpenStack. However, as the number of companies involved increased and the friction with it along, some people grew unhappy. The race to have a seat at the table of the feast and becoming a Project Team Leader was strong, so some people preferred to create their project rather than trying to play the contribution game. In many areas, including our, that divided the effort up to a ridiculous point where several teams where doing the exact the same thing, or were trying to step on each other toes to kill the competitors.

We spent a significant amount of time trying to bring other teams in the Telemetry scope, to unify our efforts, without much success. Some companies were not embracing open-source because of their cultural differences, while some others had no interest to join a project where they would not be seen as the leader.

That fragmentation did not help us, but also did not do much harm in the end. Most of those projects are now either dead or becoming irrelevant as the rest of the world caught up on what they were trying to do.


As of 2018, I'm the PTL for Telemetry – because nobody else ran. The official list of maintainer for the telemetry projects is five people: two are inactive, and three are part-time. During the latest development cycle (Queens), 48 people committed in Ceilometer, though only three developers made impactful contributions. The code size has been divided by two since the peak: Ceilometer is now 25k lines of code long.

Panko and Aodh have no active developer. A Red Hat colleague and I are maintaining the projects afloat to keep it working.

Gnocchi has humbly thriven since it left OpenStack. The stains from having been part of OpenStack are not yet all gone. It has a small community, but users see its real value and enjoy using it.

Those last six years have been intense, and riding the OpenStack train has been amazing. As I concluded in the first blog post of this series, most of us had a great time overall; the point of those writings is not to complain, but to reflect.

I find it fascinating to see how the evolution of a piece of software and the metamorphosis of its community are entangled. The amount of politics that a corporately-backed project of this size generates is majestic and has a prominent influence on the outcome of software development.

So, what's next? Well, as far as Ceilometer is concerned, we still have ideas and plans to keep shrinking its footprint to a minimum. We hope that one-day Ceilometer will become irrelevant – at least that's what we're trying to achieve so we don't have anything to maintain. That mainly depends on how the myriad of OpenStack projects will chose to address their metering.

We don't see any future for Panko nor Aodh.

Gnocchi, now blooming outside of OpenStack, is still young and promising. We've plenty of ideas and every new release brings new fancy features. The storage of timeseries at large scale is exciting. Users are happy, and the ecosystem is growing.

We'll see how all of that concludes, but I'm sure it'll be new lessons to learn and write about in six years!

by Julien Danjou at April 19, 2018 11:55 AM

April 17, 2018

RDO Blog

Unit tests on RDO package builds

Unit tests are used to verify that individual units of source code work according to a defined spec. While this may sound complicated to understand, in short it means that we try to verify that each part of our source code works as expected, without having to run the full program they belong to.

All OpenStack projects come with their own set of unit tests, for example this is the unit test folder for the oslo.config project. Those tests are executed when a new patch is proposed for review, to ensure that existing (or new) functionality is not broken with the new code. For example, if you check this review, you can see that one of the CI jobs executed is “openstack-tox-py27”, which runs unit tests using Python 2.7.

Unit tests in action

How does this translate into the packaging world? As part of a spec file, we can define a %check section, where we add scripts to test the installed code. While this is not a mandatory section in the Fedora packaging guidelines, it is highly recommended, since it provides a good assurance that the code packaged is correct.

In many cases, RDO packages include this %check section in their specs, and the project’s unit tests are executed when the package is built. This is an example of the unit tests executed for the python-oslo-utils package.

“But why are these tests executed again when packaging?”, you may ask. After all, these same tests are executed by the Zuul gate before being merged. Well, there are quite a few reasons for this:

  • Those unit tests were run with a specific operating system version and a specific package set. Those are probably different from the ones used by RDO, so we need to ensure the project compatibility with those components.
  • The project dependencies are installed in the OpenStack gate using pip, and some versions may differ. This is because OpenStack projects support a range of versions for each dependency, but usually only test with one version. We have seen cases where a project stated support for version x.0 of a library, but then added code that required version x.1. This change would not be noticed by the OpenStack gate, but it would make unit tests fail while packaging.
  • They also allow us to detect issues before they happen in the upstream gate. OpenStack projects use the requirements project to decide which version of their own libraries should be used by other projects. This allows for some inter-dependency issues, where a change in an Oslo library may uncover a bug in another project, but it is not noticed until the requirements project is updated with a new version of the Oslo library. In the RDO case, we run an RDO Trunk builder using code from the master branch in all projects, which allows us to notify in advance, like in this example bug.
  • They give us an early warning when new dependencies have been added to a project, but they are not in the package spec yet. Since unit tests exercise most of the code, any missing dependency should make them fail.

Due to the way unit tests are executed during a package build, there are some details to keep in mind when defining them. If you as a developer follow them, you will make packagers’ life easier:

  • Do not create unit tests that depend on resources available from the Internet. Most packaging environments do not allow Internet access while the package is being built, so a unit test that depends on resolving an IP address via DNS will fail.

  • Try to keep unit test runtime within reasonable limits. If unit tests for a project take 1 hour to complete, it is likely they will not be executed during packaging, such as here.

  • Do not assume that unit tests will always be executed on a machine with 8 fast cores. We have seen cases of unit tests failing when run on a limited environment or when it takes them more than a certain time to finish.

Now that you know the importance of unit tests for RDO packaging, you can go ahead and make sure we use it on every package. Happy hacking!

by jpena at April 17, 2018 04:49 PM

April 13, 2018

Red Hat Stack

Red Hatters To Present at More Than 50 OpenStack Summit Vancouver Sessions

OpenStack Summit returns to Vancouver, Canada May 21-24, 2018, and Red Hat will be returning as well with as big of a presence as ever. Red Hat will be a headline sponsor of the event, and you’ll have plenty of ways to interact with us during the show.

First, you can hear from our head of engineering and OpenStack Foundation board member, Mark McLoughlin, during the Monday morning Keynote sessions. Mark will be discussing OpenStack’s role in a hybrid cloud world, as well as the importance of OpenStack and Kubernetes integrations. After the keynotes, you’ll want to come by the Red Hat booth in the exhibit hall to score some cool SWAG (it goes quickly), talk with our experts, and check out our product demos. Finally, you’ll have the entire rest of the show to listen to Red Hatters present and co-present on a variety of topics, from specific OpenStack projects, to partner solutions, to OpenStack integrations with Kubernetes, Ansible, Ceph storage and more. These will be delivered via traditional sessions, labs, workshops, and lunch and learns. For a full list of general sessions featuring Red Hatters, see below.

Beyond meeting us at the Red Hat booth or listening to one of us speak in a session or during a keynote, here are the special events we’ll be sponsoring where you can also meet us. If you haven’t registered yet, use our sponsor code: REDHAT10 to get 10% off the list price.

Containers, Kubernetes and OpenShift on OpenStack Hands-on Training
Join the Red Hat’s OpenShift team for a full day of discussion and hands on lab to learn how OpenShift can help you deliver apps even faster on OpenStack.
Date: May 20th, 9:00 am-4:00 pm
Location: Vancouver Convention Centre West – Level Two – Room 218-219
RSVP required

Red Hat and Trilio Evening Social
All are invited to join Red Hat and Trilio for an evening of great food, drinks, and waterfront views of Vancouver Harbour.
When: Monday, May 21st, 7:30-10:30 pm
Location: TapShack Coal Harbour
RSVP required 

Red Hat and Dell: Crafting Your Cloud Reality
Join Red Hat and Dell EMC for drinks and food, and take part in the Red Hat® Cloud Challenge, an immersive virtual reality game.
When: Tuesday, May 22nd, 6:00-9:00 pm
Location: Steamworks Brew Pub
RSVP required

Women of OpenStack Networking Lunch sponsored by Red Hat
Meet with other women for lunch and discuss important topics affecting women in technology and business
Guest speaker: Margaret Dawson, Vice President of Product Marketing, Red Hat
Date: Wednesday, May 23 2018, 12:30-1:50 pm
Location: Vancouver Convention Centre West, Level 2, Room 215-216 
More information


Red Hat Training and Certification Lunch and Learn
Topic: Performance Optimization in Red Hat OpenStack Platform
Wednesday, May 23rd, 12:30-1:30 pm
Location: Vancouver Convention Centre West, Level 2, Room 213-214 

RSVP required 

Red Hat Jobs Social

Connect with Red Hatters and discover why working for the open source leader is a future worth exploring. We’ll have food, drinks, good vibes, and a chance to win some awesome swag.
Date: Wednesday, May 23, 6:00-8:00 pm
Location: Rogue Kitchen and Wetbar
RSVP required

Red Hat Sponsored Track – Monday, May 21, Room 202-204

We’ve got a great lineup of speakers on a variety of topics speaking during our sponsored breakout track on Monday, May 21. The speakers and topics are:

Session Speaker Time
Open HPE Telco NFV-Infrastructure platforms with Red Hat OpenStack Mushtaq Ahmed (HPE) 11:35 AM
What’s New in Security for Red Hat OpenStack Platform? Keith Basil 1:30 PM
Is Public Cloud Really Eating OpenStack’s Lunch? Margaret Dawson 2:20 PM
OpenShift on OpenStack and Bare Metal Ramon Acedo Rodriguez 3:10 PM
The Modern Telco is Open Ian Hood 4:20 PM
Cloud Native Applications in a Telco World – How Micro Do You Go? Ron Parker (Affirmed Networks), Azhar Sayeed 5:10 PM


Breakout Sessions Featuring Red Hatters


Session Speaker Time
OpenStackSDKs – Project Update Monty Taylor 1:30 PM
Docs/i18n – Project Onboarding Stephen Finucane, Frank Kloeker (Deutsche Telekom), Ian Y. Choi (Fusetools Korea) 1:30 PM
Linux Containers Internal Lab Scott McCarty 1:30 PM
The Wonders of NUMA, or Why Your High Performance Application Doesn’t Perform Stephen Finucane 2:10 PM
Glance – Project Update Erno Kuvaja 3:10 PM
Call It Real: Virtual GPUs in Nova Silvain Bauza, Jianhua Wang (Citrix) 3:10 PM
A Unified Approach to Role-Based Access Control Adam Young 3:10 PM
Unlock Big Data Efficiency with CephData Lake Kyle Bader, Yong Fu (Intel), Jian Zhang (Intel), Yuan Zhuo (INTC) 4:20 PM
Storage for Data Platforms Kyle Bader, Uday Boppana 5:20 PM



Session Speaker Time
OpenStack with IPv6: Now You Can! Dustin Schoenbrun, Tiago Pasqualini (NetApp), Erlon Cruz (NetApp) 9:00 AM
Integrating Keystone with large-scale centralized authentication Ken Holden, Chris Janiszewski 9:50 AM
Sahara – Project Onboarding Telles Nobrega 11:00 AM
Lower the Barries: Or How To Make Hassle-Free Open Source Events Sven Michels 11:40 AM
Barbican – Project Update Ade Lee, Dave McCowan (Cisco) 11:50 AM
Glance – Project Onboarding Erno Kuvaja, Brian Rosmaita (Verizon) 11:50 AM
Kuryr – Project Update Daniel Mellado 12:15 PM
Sahara – Project Update Telles Nobrega 1:50 PM
Heat – Project Update Rabi Mishra, Thomas Herve, Rico Lin (EasyStack) 1:50 PM
Superfluidity: One Network To Rule Them All Daniel Mellado, Luis Tomas Bolivar, Irena Berezovsky (Huawei) 3:10 PM
Burnin’ Down the Cloud: Practical Private Cloud Management David Medberry, Steven Travis (Time Warner Cable) 3:30 PM
Infra – Project Onboarding David Moreau-Simard, Clark Boylon (OpenStack Foundation) 3:30 PM
Intro to Kata Containers Components: a Hands-on Lab Sachin Rathee, Sudhir Kethamakka 4:40 PM
Kubernetes Network-policies and Neutron Security Groups – Two Sides of the Same Coin? Daniel Mellado, Eyal Leshem (Huawei) 5:20 PM
How To Survice an OpenStack Cloud Meltdown with Ceph Federico Lucifredi, Sean Cohen, Sebatien Han 5:30 PM
OpenStack Internal Messaging at the Edge: In Depth Evaluation Kenneth Giusti, Matthieu Simonin, Javier Rojas Balderrama 5:30 PM


Session Speaker Time
Barbican – Project Onboarding Ade Lee, Dave McCowan (Cisco) 9:00 AM
Oslo – Project Update Ben Nemec 9:50 AM
Kuryr – Project Onboarding Daniel Mellado, Irena Berezovsky (Hauwei) 9:50 AM
How To Work with Adjacent Open Source Communities – User, Developer, Vendor, Board Perspective Mark McLoughlin, Anni Lai (Huawei), Davanum Srinivas (Mirantis), Christopher Price (Ericsson), Gnanavelkandan Kathirvel (AT&T) 11:50 AM
Nova – Project Update Melanie Witt 11:50 AM
TripleO – Project Onboarding Alex Schultz, Emilien Macchi, Dan Prince 11:50 AM
Distributed File Storage in Multi-Tenant Clouds using CephFS Tom Barron, Ramana Raja, Patrick Donnelly 12:20 PM
Lunch & Learn – Performance optimization in Red Hat OpenStack Platform Razique Mahroa 12:30 PM
Cinder Thin Provisioning: a Comprehensive Guide Gorka Eguileor, Tiago Pasqualini (NetApp), Erlon Cruz (NetApp) 1:50 PM
Nova – Project Onboarding Melanie Witt 1:50 PM
Glance’s Power of Image Import Plugins Erno Kuvaja 2:30 PM
Mistral – Project Update Dougal Matthews 3:55 PM
Mistral – Project Onboarding Dougal Matthews, Brad Crochet 4:40 PM
Friendly Coexistence of Virtual Machines and Containers on Kubernetes using KubeVirt Stu Gott, Stephen Gordon 5:30 PM
Intro to Container Security Thomas Cameron 11:50 AM


Session Speaker Time
Manila – Project Update Tom Barron 9:00 AM
Oslo – Project Onboarding Ben Nemec, Kenneth Giusti, Jay Bryant (Lenovo) 9:00 AM
Walk Through of an Automated OpenStack Deployment Using Triple-O Coupled with OpenContrail – POC Kumythini Ratnasingham, Brent Roskos, Michael Henkel (Juniper Networks) 9:00 AM
Working Remotely in a Worldwide Community Doug Hellmann, Julia Kreger, Flavio Percoco, Kendall Nelson (OpenStack Foundation), Matthew Oliver (SUSE) 9:50 AM
Manila – Project Onboarding Tom Barron 9:50 AM
Centralized Policy Engine To Enable Multiple OpenStack Deployments for Telco/NFV Bertrand Rault, Marc Bailly (Orange), Ruan He (Orange) 11:00 AM
Kubernetes and OpenStack Unified Networking Using Calico – Hands-on Lab Amol Chobe 11:00 AM
Multi Backend CNI for Building Hybrid Workload Clusters with Kuryr and Kubernetes Daniel Mellado, Irena Berezovsky (Huawei) 11:50 AM
Workshop/Lab: Containerize your Life! Joachim von Thadden 1:50 PM
Root Your OpenStack on a Solid Foundation of Leaf-Spine Architecture! Joe Antkowiak, Ken Holden 2:10 PM
Istio: How To Make Multicloud Applications Real Christian Posta, Chris Hoge (OpenStack Foundation), Steve Drake (Cisco), Lin Sun, Costin Monolanche (Google) 2:40 PM
Push Infrastructure to the Edge with Hyperconverged Cloudlets Kevin Jones 3:30 PM
A DevOps State of Mind: Continuous Security with Kubernetes Chris Van Tuin 3:30 PM
OpenStack Upgrades Strategy: the Fast Forward Upgrade Maria Angelica Bracho, Lee Yarwood 4:40 PM
Managing OpenStack with Ansible, a Hands-on Workshop Julio Villarreal Pelegrino, Roger Lopez 4:40 PM


We’re looking forward to seeing you there!


by Peter Pawelski, Product Marketing Manager, Red Hat OpenStack Platform at April 13, 2018 07:12 PM

April 12, 2018

Julien Danjou

Lessons from OpenStack Telemetry: Incubation

Lessons from OpenStack Telemetry: Incubation

It was mostly around that time in 2012 that I and a couple of fellow open-source enthusiasts started working on Ceilometer, the first piece of software from the OpenStack Telemetry project. Six years have passed since then. I've been thinking about this blog post for several months (even years, maybe), but lacked the time and the hindsight needed to lay out my thoughts properly. In a series of posts, I would like to share my observations about the Ceilometer development history.

To understand the full picture here, I think it is fair to start with a small retrospective on the project. I'll try to keep it short, and it will be unmistakably biased, even if I'll do my best to stay objective – bear with me.


Early 2012, I remember discussing with the first Ceilometer developers the right strategy to solve the problem we were trying to address. The company I worked for wanted to run a public cloud, and billing the resources usage was at the heart of the strategy. The fact that no components in OpenStack were exposing any consumption API was a problem.

We debated about how to implement those metering features in the cloud platform. There were two natural solutions: either achieving some resource accounting report in each OpenStack projects or building a new software on the side, covering for the lack of those functionalities.

At that time there were only less than a dozen of OpenStack projects. Still, the burden of patching every project seemed like an infinite task. Having code reviewed and merged in the most significant projects took several weeks, which, considering our timeline, was a show-stopper. We wanted to go fast.

Pragmatism won, and we started implementing Ceilometer using the features each OpenStack project was offering to help us: very little.

Our first and obvious candidate for usage retrieval was Nova, where Ceilometer aimed to retrieves statistics about virtual machines instances utilization. Nova offered no API to retrieve those data – and still doesn't. Since it was out of the equation to wait several months to have such an API exposed, we took the shortcut of polling directly libvirt, Xen or VMware from Ceilometer.

That's precisely how temporary hacks become historical design. Implementing this design broke the basis of the abstraction layer that Nova aims to offer.

As time passed, several leads were followed to mitigate those trade-offs in better ways. But on each development cycle, getting anything merged in OpenStack became harder and harder. It went from patches long to review, to having a long list of requirements to merge anything. Soon, you'd have to create a blueprint to track your work, write a full specification linked to that blueprint, with that specification being reviewed itself by a bunch of the so-called core developers. The specification had to be a thorough document covering every aspect of the work, from the problem that was trying to be solved, to the technical details of the implementation. Once the specification was approved, which could take an entire cycle (6 months), you'd have to make sure that the Nova team would make your blueprint a priority. To make sure it was, you would have to fly a few thousands of kilometers from home to an OpenStack Summit, and orally argue with developers in a room filled with hundreds of other folks about the urgency of your feature compared to other blueprints.

Lessons from OpenStack Telemetry: Incubation

An OpenStack design session in Hong-Kong, 2013

Even if you passed all of those ordeals, the code you'd send could be rejected, and you'd get back to updating your specification to shed light on some particular points that confused people. Back to square one.

Nobody wanted to play that game. Not in the Telemetry team at least.

So Ceilometer continued to grow, surfing the OpenStack hype curve. More developers were joining the project every cycle – each one with its list of ideas, features or requirements cooked by its in-house product manager.

But many features did not belong in Ceilometer. They should have been in different projects. Ceilometer was the first OpenStack project to pass through the OpenStack Technical Committee incubation process that existed before the rules were relaxed.

This incubation process was uncertain, long, and painful. We had to justify the existence of the project, and many technical choices that have been made. Where we were expecting the committee to challenge us at fundamental decisions, such as breaking abstraction layers, it was mostly nit-picking about Web frameworks or database storage.


The rigidity of the process discouraged anyone to start a new project for anything related to telemetry. Therefore, everyone went ahead and started dumping its idea in Ceilometer itself. With more than ten companies interested, the frictions were high, and the project was at some point pulled apart in all directions. This phenomenon was happening to every OpenStack projects anyway.

On the one hand, many contributions brought marvelous pieces of technology to Ceilometer. We implemented several features you still don't find any metering system. Dynamically sharded, automatic horizontally scalable polling? Ceilometer has that for years, whereas you can't have it in, e.g., Prometheus.

On the other hand, there were tons of crappy features. Half-baked code merged because somebody needed to ship something. As the project grew further, some of us developers started to feel that this was getting out of control and could be disastrous. The technical debt was growing as fast as the project was.

Several technical choices made were definitely bad. The architecture was a mess; the messaging bus was easily overloaded, the storage engine was non-performant, etc. People would come to me (as I was the Project Team Leader at that time) and ask why the REST API would need 20 minutes to reply to an autoscaling request. The willingness to solve everything for everyone was killing Ceilometer. It's around that time that I decided to step out of my role of PTL and started working on Gnocchi to, at least, solve one of our biggest challenge: efficient data storage.

Ceilometer was also suffering from the poor quality of many OpenStack projects. As Ceilometer retrieves data from a dozen of other projects, it has to use their interface for data retrieval (API calls, notifications) – or sometimes, palliate for their lack of any interface. Users were complaining about Ceilometer dysfunctioning while the root of the problem was actually on the other side, in the polled project. The polling agent would try to retrieve the list of virtual machines running on Nova, but just listing and retrieving this information required several HTTP requests to Nova. And those basic retrieval requests would overload the Nova API. The API does not offer any genuine interface from where the data could be retrieved in a small number of calls. And it had terrible performances.
From the point of the view of the users, the load was generated by Ceilometer. Therefore, Ceilometer was the problem. We had to imagine new ways of circumventing tons of limitation from our siblings. That was exhausting.

At its peak, during the Juno and Kilo releases (early 2015), the code size of Ceilometer reached 54k lines of code, and the number of committers reached 100 individuals (20 regulars). We had close to zero happy user, operators were hating us, and everybody was wondering what the hell was going in those developer minds.

Nonetheless, despite the impediments, most of us had a great time working on Ceilometer. Nothing's ever perfect. I've learned tons of things during that period, which were actually mostly non-technical. Community management, social interactions, human behavior and politics were at the heart of the adventure, offering a great opportunity for self-improvement.

In the next blog post, I will cover what happened in the years that followed that booming period, up until today. Stay tuned!

by Julien Danjou at April 12, 2018 12:50 PM

April 09, 2018

Adam Young

Comparing Keystone and Istio RBAC

To continue with my previous investigation to Istio, and to continue the comparison with the comparable parts of OpenStack, I want to dig deeper into how Istio performs
RBAC. Specifically, I would love to answer the question: could Istio be used to perform the Role check?


Let me reiterate what I’ve said in the past about scope checking. Oslo-policy performs the scope check deep in the code base, long after Middleware, once the resource has been fetched from the Database. Since we can’t do this in Middleware, I think it is safe to say that we can’t do this in Istio either. SO that part of the check is outside the scope of this discussion.

Istio RBAC Introduction

Lets look at how Istio performs RBAC.

The first thing to compare is the data that is used to represent the requester. In Istio, this is the requestcontext. This is comparable to the Auth-Data that Keystone Middleware populates as a result of a successful token validation. How does Istio populate the the requestcontext? My current assumption is that it makes an Remote call to Mixer with the authenticated REMOTE_USER name.

What is telling is that, in Istio, you have

      user: source.user | ""
      groups: ""
         service: source.service | ""
         namespace: source.namespace | ""

Groups no roles. Kubernetes has RBAC, and Roles, but it is a late addition to the model. However…

Istio RBAC introduces ServiceRole and ServiceRoleBinding, both of which are defined as Kubernetes CustomResourceDefinition (CRD) objects.

ServiceRole defines a role for access to services in the mesh.
ServiceRoleBinding grants a role to subjects (e.g., a user, a group, a service)

This is interesting. Where-as Keystone requires a user to go to Keystone to get a token that is then associated with a a set of role assignments, Istio expands this assignment inside the service.

Keystone Aside: Query Auth Data without Tokens

This is actually not surprising. When looking into Keystone Middleware years ago, in the context of PKI tokens, I realized that we could do exactly the same thing; make a call to Keystone based on the identity, and look up all of the data associated with the token. This means that a user can go from a SAML provider right to the service without first getting a Keystone token.

What this means is that the Mixer can respond return the Roles assigned by Kubernetes as additional parameters in the “Properties” collection. However, with the ServiceRole, you would instead get the Service Role Binding list from Mixer and apply it in process.

We discussed Service Roles on multiple occasions in Keystone. I liked the idea, but wanted to make sure that we didn’t limit the assignments, or even the definitions, to just a service. I could see specific Endpoints varying in their roles even within the same service, and certainly have different Service Role Assignments. I’m not certain if Istio distinguishes between “services” and “different endpoints of the same service” yet…something I need to delve in to. However, assuming that it does distinguish, what Istio needs to be able to get request is “Give me the set of Role bindings for this specific endpoint.”

A history lesson in Endpoint Ids.

It was this last step that was a problem in Keystonemiddleware. An endpoint did not know its own ID, and the provisioning tools really did not like the workflow of

  1. create an endpoint for a service
  2. register endpoint with Keystone
  3. get back the endpoint ID
  4. add endpoint  ID to the config file
  5. restart the service

Even if we went with an URL based scheme, we would have had this problem.  An obvious (in hindsight) solution would be to pre-generate the Ids as a unique hash, and to pre-populate the configuration files as well as to post the IDs to Keystone.  These IDs could easily be tagged as a nickname, not even the canonical name of the service.

Istio Initialization

Istio does no have this problem, directly, as it knows the name of the service that it is protecting, and can use that to fetch the correct rules.  However, it does point to a chicken-egg problem that Istio has to solve; which is created first, the service itself, or the abstraction in Istio to cover it?  Since Kubernetes is going to orchestrate the Service deployment, it can make the sensible call;  Istio can cover the service and just reject calls until it is properly configured.

URL Matching Rules

If we look at the Policy enforcement in Nova, we can use the latest “Policy in Code” mechanisms to link from the URL pattern to the Policy rule key, and the key to the actual enforced policy.  For example, to delete a server we can look up the API

And see that it is


And from the Nova source code:

        SERVERS % 'delete',
        "Delete a server",
                'method': 'DELETE',
                'path': '/servers/{server_id}'

With SERVERS %  expanding via :  SERVERS = 'os_compute_api:servers:%s'  to  os_compute_api:servers:delete.

Digging into Openstack Policy

Then, assuming you can get you hand on the policy file specific to that Nova server you could look at the policy for that rule. Nova no longer includes that generated file in the etc directory. But in my local repo I have:
"os_compute_api:servers:delete": "rule:admin_or_owner"

And the rule:admin_or_owner expanding to "admin_or_owner": "is_admin:True or project_id:%(project_id)s" which does not do a role check at all. The policy.yaml or policy.json file is not guaranteed to exist, in which case you can either use the tool to generate it, or read the source code. From the above link we see the Rule is:


and then we need to look where that is defined.

Lets assume, for the moment, that a Nova deployment has overridden the main rule to implement a custom role called Custodian which has the ability to execute this API. Could Istio match that? It really depends on whether it can match the URL-Pattern: '/servers/{server_id}'.

In ServiceRole, the combination of “namespace”+”services”+”paths”+”methods” defines “how a service (services) is allowed to be accessed”.

So we can match down to the Path level. However, there seems to be no way to tokenize a Path. Thus, while you could set a rule that says a client can call DELETE on a specific instance, or DELETE on /services, or even DELETE on all URLS in the catalog (whether they support that API or not) you could not say that it could call delete on all services within a specific Namespace. If the URL were defined like this:

DELETE /services?service_id={someuuid}

Istio would be able to match the service ID in the set of keys.

In order for Istio to be able to effectively match, all it really would need would be to identify that an URL that ends /services/feed1234 Matches the pattern /services/{service_id} which is all that the URL pattern matching inside the Web servers do.

Istio matching

It looks like paths can have wildcards. Scroll down a bit to the quote:

In addition, we support prefix match and suffix match for all the fields in a rule. For example, you can define a “tester” role that has the following permissions in “default” namespace:

which has the further example:

    - services: ["bookstore.default.svc.cluster.local"]
       paths: ["*/reviews"]
       methods: ["GET"]

Deep URL matching

So, while this is a good start, there are many more complicated URLs in the OpenStack world which are tokenized in the middle: for example, the new API for System role assignments has both the Role ID and the User ID embedded. The Istio match would be limited to matching: PUT /v3/system/users/* which might be OK in this case. But there are cases where a PUT at one level means one role, much more powerful than a PUT deeper in the URL chain.

For example: The base role assignments API itself is much more complex. To assign a role on a domain uses an URL fragment comparable to that to edit the domain specific configuration file. Both would have to be matched with

       paths: ["/v3/domains/*"]
       methods: ["PUT"]

But assigning a role is a far safer operation than setting a domain specific config, which is really an administrative only operation.

However, I had to dig deeply to find this conflict. I suspect that there are ways around it, and comparable conflicts in the catalog.


So, the tentative answer to my question is:

Yes, Istio could perform the Role check part of RBAC for OpenStack.

But it would take some work. Of Course. An early step would be to write a Mixer plugin to fetch the auth-data from Keystone based on a user. This would require knowing about Federated mappings and how to expand them, plus query the Role assignments. Of, and get the list of Groups for a user. And the project ID needs to be communicated, somehow.

by Adam Young at April 09, 2018 05:55 PM


Scaling ARA to a million Ansible playbooks a month

The OpenStack community runs over 300 000 CI jobs with Ansible every month with the help of the awesome Zuul. It even provides ARA reports for ARA’s integration test jobs in a sort-of nested way. Zuul’s Ansible ends up installing Ansible and ARA. It makes my brain hurt sometimes… but in an awesome way. As a core contributor of the infrastructure team there, I get to witness issues and get a lot of feedback directly from the users.

April 09, 2018 12:00 AM

April 07, 2018

Adam Young

Comparing Istio and Keystone Middleware

One way to learn a new technology is to compare it to what you already know. I’ve heard a lot about Istio, and I don’t really grok it yet, so this post is my attempt to get the ideas solid in my own head, and to spur conversations out there.

I asked the great Google “Why is Istio important” and this was the most interesting response it gave me: “What is Istio and Its Importance to Container Security.” So I am going to start there. There are obviously many articles about Istio, and this might not even be the best starting point, but this is the internet: I’m sure Ill be told why something else is better!

Lets start with the definition:

Istio is an intelligent and robust web proxy for traffic within your Kubernetes cluster as well as incoming traffic to your cluster

At first blush, these seems to be nothing like Keystone. However, Lets take a look at the software definition of Proxy:

A proxy, in its most general form, is a class functioning as an interface to something else.

In the OpenStack code base, the package python-keystonemiddleware provides a Python class that complies with the WSGI contract that serves as a Proxy to the we application underneath. Keystone Middleware, then is an analogue to the Istio Proxy in that it performs some of the same functions.

Istio enables you to specify access control rules for web traffic between Kubernetes services

So…Keystone + Oslo-policy serves this role in OpenStack. The Kubernetes central control is a single web server, and thus it can implement Access control for all subcomponenets in a single process space. OpenStack is distributed, and thus the access control is also distributed. However, due to the way that OpenStack objects are stored, we cannot do the full RBAC enforcement in middleware (much as I would like to). IN order to check access to an existing resource object in OpenStack, you have to perform the policy enforcement check after the object has been fetched from the Database. That check needs to ensure that the project of the token matches the project of the resource. Since we don’t know this information based solely on the URL, we cannot perform it in Middleware.

What we can perform in Middleware, and what I presented on last year at the OpenStack Summit, is the ability to perform the Role check portion of RBAC in middleware, but defer the project check until later. While we are not going to be doing exactly that, we are pursuing a related effort for application credentials. However, that requires a remote call to a database to create those rules. Istio is not going to have that leeway. I think? Please correct me if I am wrong.

I don’t think Istio could perform this level of deep check, either. It requires parsing the URL and knowing the semantics of the segments, and having the ability to correlate them. That is a lot to ask.

Isito enables you to seamlessly enforce encryption and authentication between node

Keystone certainly does not do this. Nothing enforced TLS between services in OpenStack. Getting TLS everywhere in Tripleo was a huge effort, and it still needs to be explicitly enabled. OpenStack does not provide a CA. Tripleo, when deployed, depends on the Dogtag instance from the FreeIPA server to manage certificates.

By the time Keystone Middleware is executed, the TLS layer would be a distant memory.

Keystoneauth1 is the client piece from Keystone, and it could be responsible for making sure that only HTTPS is supported, but it does not do that today.

Istio collects traffic logs, and then parses and presents them for you:

Keystone does not do this, although it does produce some essential log entries about access.

At this point, I am wondering if Istio would be a viable complement to the security story in OpenStack. My understand thus far is that it would. It might conflict a minor bit with the RBAC enforcement, but I suspect that is no the key piece of what it is doing, and conflict there could be avoided.

Please post your comments, as I would really like to get to know this better, and we can share the discussion with the larger community.

by Adam Young at April 07, 2018 10:00 PM

March 29, 2018

Red Hat Stack

Heading to Red Hat Summit? Here’s how you can learn more about OpenStack.

Red Hat Summit is just around the corner, and we’re excited to share all the ways in which you can connect with OpenStack® and learn more about this powerful cloud infrastructure technology. If you’re lucky enough to be headed to the event in San Francisco, May 8-10, we’re looking forward to seeing you. If you can’t go, fear not, there will be ways to see some of what’s going on there remotely. And if you’re undecided, what are you waiting for? Register today

From the time Red Hat Summit begins you can find hands-on labs, general sessions, panel discussions, demos in our partner pavillion (Hybrid Cloud section), and more throughout the week. You’ll also hear from Red Hat OpenStack Platform customers on their successes during some of the keynote presentations. Need an open, massively scalable storage solution for your cloud infrastructure? We’ll also have sessions dedicated to our Red Hat Ceph Storage product.

Red Hat Summit has grown significantly over the years, and this year we’ll be holding activities in both the Moscone South and Moscone West. And with all of the OpenStack sessions and labs happening, it may seem daunting to make it to everything, especially if you need to transition from one building to the next. But worry not. Our good friends from Red Hat Virtualization will be sponsoring pedicabs to help transport you between the buildings.   

Here’s our list of sessions for OpenStack and Ceph at Red Hat Summit:


Session Speaker Time / Location
Lab – Deploy a containerized HCI IaaS with OpenStack and Ceph Rhys Oxenham, Greg Charot, Sebastien Han, John Fulton 10:00 am / Moscone South, room 156
Ironic, VM operability combined with bare-metal performances Cedric Morandin (Amadeus) 10:30 am / Moscone West, room 2006
Lab – Hands-on with OpenStack and OpenDaylight SDN Rhys Oxenham, Nir Yechiel, Andre Fredette, Tim Rozat 1:00 pm / Moscone South, room 158
Panel – OpenStack use cases: how business succeeds with OpenStack featuring Cisco, IAG, Turkcell, and Duke Health August Simonelli, Pete Pawelski 3:30 pm / Moscone West, room 2007
Lab – Understanding containerized Red Hat OpenStack Platform Ian Pilcher, Greg Charot 4:00 pm / Moscone South, room 153
Red Hat OpenStack Platform: the road ahead Nick Barcet, Mark McLoughlin 4:30 pm / Moscone West, room 2007


Session Speaker Time / Location
Lab – First time hands-on with Red Hat OpenStack Platform Rhys Oxenham, Jacob Liberman 10:00 am / Moscone South, room 158
Red Hat Ceph Storage roadmap: past, present, and future Neil Levine 10:30 am / Moscone West, room 2024
Optimize Ceph object storage for production in multisite clouds Michael Hackett, John Wilkins 11:45 am / Moscone South, room 208
Production-ready NFV at Telecom Italia (TIM) Fabrizio Pezzella, Matteo Bernacchi, Antonio Gianfreda (Telecom Italia) 11:45 am / Moscone West, room 2002
Workload portability using Red Hat CloudForms and Ansible Bill Helgeson, Jason Woods, Marco Berube 11:45 am / Moscone West, room 2009
Delivering Red Hat OpenShift at ease on Red Hat OpenStack Platform and Red Hat Virtualization Francesco Vollero, Natale Vinto 3:30 pm / Moscone South, room 206
The future of storage and how it is shaping our roadmap Sage Weil 3:30 pm / Moscone West, room 2020
Lab – Hands on with Red Hat OpenStack Platform Rhys Oxenham, Jacob Liberman 4:00 pm / Moscone South, room 153
OpenStack and OpenShift networking integration Russell Bryant, Antoni Segura Puimedon, and Jose Maria Ruesta (BBVA) 4:30 pm / Moscone West, room 2011



Session Speaker Time / Location
Workshop – OpenStack roadmap in action Rhys Oxenham 10:45 am / Moscone South, room 214
Medical image processing with OpenShift and OpenStack Daniel McPherson, Ata Turk (Boston University), Rudolph Pienaar (Boston Children’s Hospital) 11:15 am / Moscone West, room 2006
Scalable application platform on Ceph, OpenStack, and Ansible Keith Hagberg (Fidelity), Senthivelrajan Lakshmanan (Fidelity), Michael Pagan, Sacha Dubois, Alexander Brovman (Solera Holdings) 1:00 pm / Moscone West, room 2007
Red Hat CloudForms: turbocharge your OpenStack Kevin Jones, Jason Ritenour 2:00 pm / Moscone West, room 201
What’s new in security for Red Hat OpenStack Platform? Nathan Kinder, Keith Basil 2:00 pm / Moscone West, room 2003
Ceph Object Storage for Apache Spark data platforms Kyle Bader, Mengmeng Liu 2:00 pm / Moscone South, room 207
OpenStack on FlexPod – like peanut butter and jelly Guil Barros, Amit Borulkar NetApp 3:00 pm / Moscone West, room 2009


Hope to see you there!


by Peter Pawelski, Product Marketing Manager, Red Hat OpenStack Platform at March 29, 2018 02:00 PM

March 27, 2018

Lars Kellogg-Stedman

Multiple 1-Wire Buses on the Raspberry Pi

The DS18B20 is a popular temperature sensor that uses the 1-Wire protocol for communication. Recent versions of the Linux kernel include a kernel driver for this protocol, making it relatively convenient to connect one or more of these devices to a Raspberry Pi or similar device. 1-Wire devices can be …

by Lars Kellogg-Stedman at March 27, 2018 04:00 AM

March 22, 2018

Red Hat Stack

An Introduction to Fast Forward Upgrades in Red Hat OpenStack Platform

OpenStack momentum continues to grow as an important component of hybrid cloud, particularly among enterprise and telco. At Red Hat, we continue to seek ways to make it easier to consume. We offer extensive, industry-leading training, an easy to use installation and lifecycle management tool, and the advantage of being able to support the deployment from the app layer to the OS layer.

One area that some of our customers ask about is the rapid release cycle of OpenStack. And while this speed can be advantageous in getting key features to market faster, it can also be quite challenging to follow for customers looking for stability.

With the release of Red Hat OpenStack Platform 10 in December 2016, we introduced a solution to this challenge – we call it the Long Life release. This type of release includes support for a single OpenStack release for a minimum of three years plus an option to extend another two full years. We offer this via an ELS (Extended Life Support) allowing our customers to remain on a supported, production-grade OpenStack code base for far longer than the usual 6 month upstream release cycle. Then, when it’s time to upgrade, they can upgrade in-place and without additional hardware to the next Long Life release. We aim to designate a Long Life release every third release, starting with Red Hat OpenStack Platform 10 (Newton).

Now, with the upcoming release of Red Hat OpenStack Platform 13 (Queens), we are introducing our second Long Life release. This means we can, finally and with great excitement, introduce the world to our latest new feature: the fast forward upgrade.

Screen Shot 2018-03-22 at 9.33.50 am

Fast forward upgrades take customers easily between Long Life releases. It is a first for our OpenStack distribution, using Red Hat OpenStack Platform director (known upstream as TripleO) and aims to change the game for OpenStack customers. With this feature, you can choose to stay on the “fast train” and get all the features from upstream every six months, or remain on a designated release for longer, greatly easing the upgrade treadmill some customers are feeling.

It’s a pretty clever procedure as it factorizes the process of three consecutive upgrades and therefore greatly reduces the number of steps needed to perform it. In particular, it reduces the number of reboots needed, which in the case of very large deployments makes a huge difference. This capability, combined with the extended support for security vulnerabilities and key backports from future releases, has made Red Hat OpenSack Platform 10 very popular with our customers.


Red Hat OpenStack Life Cycle dates

Under the covers of a Fast Forward Upgrade

The fast forward upgrade starts like any other upgrade, with a Red Hat OpenStack Platform 10 minor update. A minor update may contain everything from security patches to functional enhancements to even backports from newer releases. There is nothing new about the update procedure for Red Hat OpenStack Platform 10, what changes is the packages that will be included such as kernel changes and other OpenStack specific changes. We’re placing all changes requiring a reboot in this minor update. The update procedure allows for a sequential update of the undercloud, control plane nodes and the compute nodes. It may also include an instance evacuation procedure so that there is no impact to running workloads even if they reside on a node scheduled for reboot after the update. The resulting Red Hat OpenStack Platform 10 cloud will have the necessary operating system components to operate in Red Hat OpenStack Platform 13 without further node reboots.

The next step is the sequential upgrade of the undercloud from Red Hat OpenStack Platform 10, to 11, to 12, to 13. There is no stopping during these steps and in case of a needed rollback of this portion you must return to Red Hat OpenStack Platform 10.

During the fast forward upgrade there are opportunities to perform backups. These should be performed since there’s no automated rewind included but rather a restore from these backups.

A lot of things have changed between Red Hat OpenStack Platform 10 and 13. The most notable is the introduction of OpenStack services in containers. But don’t worry! The fast forward upgrade procedure takes the cloud from a non-containerized deployment to a resulting cloud with OpenStack services running in containers, while abstracting and reducing the complexity of upgrading through the middle releases of Red Hat OpenStack Platform 11 and 12.

In the final steps of the fast forward upgrade procedure, the overcloud will move from Red Hat OpenStack Platform 10 to 13 using a procedure that syncs databases, generates templates for 13, and installs 13’s services in containers. While some of the content for these steps may be part of a release which is no longer supported, Red Hat will provide full support for the required code to perform the upgrade.

What’s next …

In order for this procedure to be supported, it needs to be validated with the released code, and carefully tested in many situations. For this reason, it is scheduled to be ready for testing from Red Hat OpenStack Platform 13 general availability (GA); however, we will warn users launching the procedure that it should not be used in production environments. We encourage you to try the procedure on test environments during this period, and report any issues you find via the normal support procedure.  This will greatly help us ensure that we are covering all cases. During this time support cases relating to fast forward upgrades will not be eligible for high priority response times. 

Once we have thoroughly field tested the procedure, fixed bugs, and are confident that it is ready, we will remove this warning and make an announcement on this same blog. After this happens, it will be OK to proceed with fast forward upgrades in production environments. You can follow this progress of validation and testing by following this blog and staying in touch with your local Red Hat account and support teams.

Stay tuned for future fast forward upgrade blogs where we will dig deeper into the details of this procedure and share experiences and use cases that we’ve tested and validated.


Additionally, we will give an in depth presentation on the fast forward upgrade process at this year’s Red Hat Summit May 8-10 in San Francisco and  OpenStack Summit in Vancouver May 21-24. Please come and visit us in San Francisco and Vancouver for exciting Red Hat news, demos, and direct access to Red Hatters from all over the world. See you there!

by Maria Bracho, Principal Product Manager OpenStack at March 22, 2018 09:15 AM

March 19, 2018

Giulio Fidente

Ceph integration topics at OpenStack PTG

I wanted to share a short summary of the discussions happened around the Ceph integration (in TripleO) at the OpenStack PTG.

ceph-{container,ansible} branching

Together with John Fulton and Guillaume Abrioux (and after PTG, Sebastien Han) we put some thought into how to make the Ceph container images and ceph-ansible releases fit better the OpenStack model; the container images and ceph-ansible are in fact loosely coupled (not all versions of the container images work with all versions of ceph-ansible) and we wanted to move from a "rolling release" into a "point release" approach, mainly to permit regular maintenance of the previous versions known to work with the previous OpenStack versions. The plan goes more or less as follows:

  • ceph-{container,ansible} should be released together with the regular ceph updates
  • ceph-container will start using tags and stable branches like ceph-ansible does

The changes for the ceph/daemon docker images are visible already:

Multiple Ceph clusters

In the attempt to support better the "edge computing" use case, we discussed adding support for the deployment of multiple Ceph clusters in the overcloud.

Together with John Fulton and Steven Hardy (and after PTG, Gregory Charot) we realized this could be done using multiple stacks and by doing so, hopefully simplify managament of the "cells" and avoid potential issues due to orchestration of large clusters.

Much of this will build on Shardy's blueprint to split the control plane, see spec at:

The multiple Ceph clusters specifics will be tracked via another blueprint:

ceph-ansible testing with TripleO

We had a very good chat with John Fulton, Guillaume Abrioux, Wesley Hayutin and Javier Pena on how to get tested new pull requests for ceph-ansible with TripleO; basically trigger an existing TripleO scenario on changes proposed to ceph-ansible.

Given ceph-ansible is hosted on github, Wesley's and Javier suggested this should be possible with Zuul v3 and volunteered to help; some of the complications are about building an RPM from uncommitted changes for testing.

Move ceph-ansible triggering from workflow_tasks to external_deploy_tasks

This is a requirement for the Rocky release; we want to migrate away from using workflow_tasks and use external_deploy_tasks instead, to integrate into the "config-download" mechanism.

This work is tracked via a blueprint and we have a WIP submission on review:

We're also working with Sofer Athlan-Guyot on the enablement of Ceph in the upgrade CI jobs and with Tom Barron on scenario004 to deploy Manila with Ganesha (and CephFS) instead of the CephFS native backend.

Hopefully I didn't forget much; to stay updated on the progress join #tripleo on freenode or check our integration squad status at:

by Giulio Fidente at March 19, 2018 02:32 AM

March 16, 2018

Adam Young

Generating a list of URL patterns for OpenStack services.

Last year at the Boston OpenStack summit, I presented on an Idea of using URL patterns to enforce RBAC. While this idea is on hold for the time being, a related approach is moving forward building on top of application credentials. In this approach, the set of acceptable URLs is added to the role, so it is an additional check. This is a lower barrier to entry approach.

One thing I requested on the specification was to use the same mechanism as I had put forth on the RBAC in Middleware spec: the URL pattern. The set of acceptable URL patterns will be specified by an operator.

The user selects the URL pattern they want to add as a “white-list” to their application credential. A user could further specify a dictionary to fill in the segments of that URL pattern, to get a delegation down to an individual resource.

I wanted to see how easy it would be to generate a list of URL patterns. It turns out that, for the projects that are using the oslo-policy-in-code approach, it is pretty easy;

cd /opt/stack/nova
 . .tox/py35/bin/activate
(py35) [ayoung@ayoung541 nova]$ oslopolicy-sample-generator  --namespace nova | egrep "POST|GET|DELETE|PUT" | sed 's!#!!'
 POST  /servers/{server_id}/action (os-resetState)
 POST  /servers/{server_id}/action (injectNetworkInfo)
 POST  /servers/{server_id}/action (resetNetwork)
 POST  /servers/{server_id}/action (changePassword)
 GET  /os-agents
 POST  /os-agents
 PUT  /os-agents/{agent_build_id}
 DELETE  /os-agents/{agent_build_id}

Similar for Keystone

$ oslopolicy-sample-generator  --namespace keystone  | egrep "POST|GET|DELETE|PUT" | sed 's!# !!' | head -10
GET  /v3/users/{user_id}/application_credentials/{application_credential_id}
GET  /v3/users/{user_id}/application_credentials
POST  /v3/users/{user_id}/application_credentials
DELETE  /v3/users/{user_id}/application_credentials/{application_credential_id}
PUT  /v3/OS-OAUTH1/authorize/{request_token_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}/roles/{role_id}
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens
GET  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}/roles
DELETE  /v3/users/{user_id}/OS-OAUTH1/access_tokens/{access_token_id}

The output of the tool is a little sub-optimal, as the oslo policy enforcement used to be done using only JSON, and JSON does not allow comments, so I had to scrape the comments out of the YAML format. Ideally, we could tweak the tool to output the URL patterns and the policy rules that enforce them in a clean format.

What roles are used? Turns out, we can figure that out, too:

$ oslopolicy-sample-generator  --namespace keystone  |  grep \"role:
#"admin_required": "role:admin or is_admin:1"
#"service_role": "role:service"

So only admin or service are actually used. On Nova:

$ oslopolicy-sample-generator  --namespace nova  |  grep \"role:
#"context_is_admin": "role:admin"

Only admin.

How about matching the URL pattern to the policy rule?
If I run

oslopolicy-sample-generator  --namespace nova  |  less

In the middle I can see an example like this (# marsk removed for syntax):

# Create, list, update, and delete guest agent builds

# This is XenAPI driver specific.
# It is used to force the upgrade of the XenAPI guest agent on
# instance boot.
 GET  /os-agents
 POST  /os-agents
 PUT  /os-agents/{agent_build_id}
 DELETE  /os-agents/{agent_build_id}
"os_compute_api:os-agents": "rule:admin_api"

This is not 100% deterministic, though, as some services, Nova in particular, enforce policy based on the payload.

For example, these operations can be done by the resource owner:

# Restore a soft deleted server or force delete a server before
# deferred cleanup
 POST  /servers/{server_id}/action (restore)
 POST  /servers/{server_id}/action (forceDelete)
"os_compute_api:os-deferred-delete": "rule:admin_or_owner"

Where as these operations must be done by an admin operator:

# Evacuate a server from a failed host to a new host
 POST  /servers/{server_id}/action (evacuate)
"os_compute_api:os-evacuate": "rule:admin_api"

Both map to the same URL pattern. We tripped over this when working on RBAC in Middleware, and it is going to be an issue with the Whitelist as well.

Looking at the API docs, we can see that difference in the bodies of the operations. The Evacuate call has a body like this:

    "evacuate": {
        "host": "b419863b7d814906a68fb31703c0dbd6",
        "adminPass": "MySecretPass",
        "onSharedStorage": "False"

Whereas the forceDelete call has a body like this:

    "forceDelete": null

From these, it is pretty straight forward to figure out what policy to apply, but as of yet, there is no programmatic way to access that.

It would take a little more scripting to try and identity the set of rules that mean a user should be able to perform those actions with a project scoped token versus the set of APIs that are reserved for cloud operations. However, just looking at the admin_or_owner rule for most is sufficient to indicate that it should be performed using a scoped token. Thus, an end user should be able to determine the set of operations that she can include in a white-list.

by Adam Young at March 16, 2018 05:35 PM

March 15, 2018

RDO Blog

March 14 Community Blog Round-up

Hardware burn-in in the CERN datacenter by Tim Bell

During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.


Using Docker macvlan networks by Lars Kellogg-Stedman

A question that crops up regularly on #docker is “How do I attach a container directly to my local network?” One possible answer to that question is the macvlan network type, which lets you create “clones” of a physical interface on your host and use that to attach containers directly to your local network. For the most part it works great, but it does come with some minor caveats and limitations. I would like to explore those here.


A New Fencing Mechanism (TBD) by Andrew Beekhof

Protecting Database Centric Applications. In the same way that some application require the ability to persist records to disk, for some applications the loss of access to the database means game over – more so than disconnection from the storage.


Generating a Callgraph for Keystone by Adam Young

Once I know a starting point for a call, I want to track the other functions that it calls. pycallgraph will generate an image that shows me that.


Inspecting Keystone Routes by Adam Young

What Policy is enforced when you call a Keystone API? Right now, there is no definitive way to say. However, with some programmatic help, we might be able to figure it out from the source code. Lets start by getting a complete list of the Keystone routes.


SnowpenStack by rbowen

I’m heading home from SnowpenStack and it was quite a ride. As Theirry said in our interview at the end of Friday (coming soon to a YouTube channel near you), rather than spoiling things, the freak storm and subsequent closure of the event venue served to create a shared experience and camaraderie that made it even better.


Expiry of VMs in the CERN cloud by Jose Castro Leon

The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.


My 2nd birthday as a Red Hatter by Carlos Camacho

This post will be about to speak about my experience working in TripleO as a Red Hatter for the last 2 years. In my 2nd birthday as a Red Hatter, I have learned about many technologies, really a lot… But the most intriguing thing is that here you never stop learning. Not just because you just don’t want to learn new things, instead, is because of the project’s nature, this project… TripleO…


by Mary Thengvall at March 15, 2018 02:27 AM

March 12, 2018

OpenStack In Production (CERN)

Hardware burn-in in the CERN datacenter

During the Ironic sessions at the recent OpenStack Dublin PTG in Spring 2018, there were some discussions on adding a further burn in step to the OpenStack Bare Metal project (Ironic) state machine. The notes summarising the sessions were reported to the openstack-dev list. This blog covers the CERN burn in process for the systems delivered to the data centers as one example of how OpenStack Ironic users could benefit from a set of open source tools to burn in newly delivered servers as a stage within the Ironic workflow.

CERN hardware procurement follows a formal process compliant with public procurements. Following a market survey to identify potential companies in CERN's member states, a tender specification is sent to the companies asking for offers based on technical requirements.

Server burn in goals

Following the public procurement processes at CERN, large hardware deliveries occur once or twice a year and smaller deliveries multiple times per year. The overall resource management at CERN was covered in a previous blog. Part of the steps before production involves burn in of new servers. The goals are
  • Ensure that the hardware delivered complies with CERN Technical Specifications
  • Find systematic issues with all machines in a delivery such as bad firmware
  • Identify failed components in single machines
  • Provoke early failure in failing components due to high load during stress testing
Depending on the hardware configuration, the burn-in tests take on average around two weeks but do vary significantly (e.g. for systems with large memory amounts, the memory tests alone can take up to two weeks). This has been found to be a reasonable balance between achieving the goals above compared to delaying the production use of the machines with further testing which may not find more errors.

Successful execution of the CERN burn in processes is required in the tender documents prior to completion of the invoicing.


The CERN hardware follows a lifecycle from procurement to retirement as outlined below. The parts marked in red are the ones currently being implemented as part of the CERN Bare Metal deployment.

As part of the evaluation, test systems are requested from the vendor and these are used to validate compliance with the specifications. The results are also retained to ensure that the bulk equipment deliveries correspond to the initial test system configurations and performance.

Preliminary Checks

CERN requires that the Purchase Order ID  and an unique System Serial Number are set in the NVRAM of the Baseboard Management Controller (BMC), in the Field Replaceable Unit (FRU) fields Product Asset Tag (PAT) and Product Serial (PS) respectively:

# ipmitool fru print 0 | tail -2
 Product Serial        : 245410-1
 Product Asset Tag     : CD5792984

The Product Asset Tag is set to the CERN delivery number and the Product Serial is set to the unique serial number for the system unit.

Likewise, certain BIOS fields have to be set correctly such as booting from network before disk to ensure the systems can be easily commissioned.

Once these basic checks have been done, the burn in process can start. A configuration file, containing the burn-in tests to be run, is created according on the information stored in the PAT and PS FRU fields. Based on the content of the configuration file, the enabled tests will automatically start.

Burn in

The burn in process itself is highlighted in red in the workflow above, consisting of the following steps
  • Memory
  • CPU
  • Storage
  • Benchmarking
  • Network


The memtest stress tester is used for validation of the RAM in the system. Details of the tool are available at


Testing the CPU is performed using a set of burn tools, burnK7 or burnP6, and burn MMX. These tools not only test the CPU itself but are also useful to find cooling issues such as broken fans since the power load is significant with the processors running these tests.


Disk burn ins are intended to create the conditions for early drive failure. The bathtub curve aims to cause the early failure drives to fail prior to production.

With this aim, we use the badblocks code to repeatedly read/write the disks. SMART counters are then checked to see if there are significant numbers of relocated bad blocks and the CERN tenders require disk replacement if the error rate is high.

We still use this process although the primary disk storage for the operating system has now changed to SSD. There may be a case for minimising the writing on an SSD to maximise the life cycle of the units.


Many of the CERN hardware procurements are based on price for total compute capacity needed. With the nature of most of the physics processing, the total throughput of the compute farm is more important than the individual processor performance. Thus, it may be that the most total performance can be achieved by choosing processors which are slightly slower but less expensive.

CERN currently measures the CPU performance using a set of benchmarks based on a subset of the SPEC 2006 suite. The subset, called HEPSpec06, is run in parallel on each of the cores in the server to determine the total throughput from the system. Details are available at the HEPiX Benchmarking Working Group web site.

Since the offers include the expected benchmark performance, the results of the benchmarking process are used to validate the technical questionnaire submitted by the vendors. All machines in the same delivery would be expected to produce similar results so variations between different machines in the same batch are investigated.

CPU benchmarking can also be used to find problems where there is significant difference across a batch, such as incorrect BIOS settings on a particular system.

Disk performance is checked using a reference fio access suite. A minimum performance level in I/O is also required in the tender documents.


Networking interfaces are difficult to burn in compared to disks or CPU. To do a reasonable validation,  at lest two machines are needed. With batches of 100s of servers, a simple test against a single end point will produce unpredictable results.

Using a network broadcast, the test finds other machines running the stress test, they pair up and run a number of tests.

  • iperf3 is used for bandwidth, reversed bandwidth, udp and reversed udp
  • iperf for full duplex testing (currently missing from iperf3)
  • ping is used for congestion testing

Looking forward

CERN is currently deploying Ironic into production for bare metal management of machines. Integrating the burn in and retirement stages into the bare metal management states would bring easy visibility of the current state as the deliveries are processed.

The retirement stage is also of interest to ensure that there is no CERN configuration in the servers (such as Ironic BMC credentials or IP addresses).  CERN has often donated retired servers to other high energy physics sites such as SESAME in Jordan and Morocco which requires a full server factory reset before dismounting. This retirement step would be a more extreme cleaning followed by complete removal from the cloud.

Discussing with other scientific laboratories such as SKA through the OpenStack Scientific special interest group has shown interest in extending Ironic to automate the server on-boarding and retirement processes as described in the session at the OpenStack Sydney summit. We'll be following up on these discussions at Vancouver.


  • CERN IT department -
  • CERN Ironic and Rework Contributors 
    • Alexandru Grigore
    • Daniel Abad
    • Mateusz Kowalski


by Tim Bell ( at March 12, 2018 11:29 AM

Lars Kellogg-Stedman

Using Docker macvlan networks

A question that crops up regularly on #docker is "How do I attach a container directly to my local network?" One possible answer to that question is the macvlan network type, which lets you create "clones" of a physical interface on your host and use that to attach containers directly …

by Lars Kellogg-Stedman at March 12, 2018 04:00 AM

March 09, 2018

Julie Pichon

OpenStack PTG Dublin - Rocky

I was so excited when it was first hinted in Denver that the next OpenStack PTG would be in Dublin. In my town! Zero jet lag! Commuting from home! Showing people around! Alas, it was not to be. Thanks, Beast from the East. Now everybody hates Ireland forever.

The weather definitely had some impact on sessions and productivity. People were jokingly then worryingly checking on the news, dropping in and out of rooms as they tried to rebook their cancelled flights. Still we did what we could and had snow-related activities too - good for building team spirit, if nothing else!

I mostly dropped in and out of rooms, here are some of my notes and sometimes highlights.

OpenStack Client

Like before, the first two days of the PTG were focused on cross-projects concerns. The OpenStack Client didn't have a room this time, which seems fair as it was sparsely attended the last couple of times - I would have thought there'd have been one helproom session at least but if there was I missed it.

I regret missing the API Working Group morning sessions on API discovery and micro-versions, which I think were relevant. The afternoon API sessions were more focused on services and less applicable for me. I need to be smarter about it next time.

First Contact SIG

Instead, that morning I attended the First Contact Special Interest Group sessions, which aim to make OpenStack more accessible to newcomers. It was well attended, with even a few new and want-to-be contributors who were first-time PTG attendes - I think having the PTG in Europe really helped with that. The session focused on making sure everyone in the room/SIG is aware of the resources that are out there, to be able to help people looking to get started.

The SIG is also looking for points of contact for every project, so that newcomers have someone to ask questions to directly (even better if there's a backup person too, but difficult enough to find one as it is!).

Some of the questions that came up from people in the room related to being able to map projects to IRC channel (e.g. devstack questions go to #openstack-qa).

Also, the OpenStack community has a ton of mentoring programs, both formal and informal and just going through the list to explain them took a while. Outreachy, Google Summer of Code, Upstream Institute, Women of OpenStack, First Contact Liaisons (see above). Didn't realise there were so many!

I remember when a lot of the initiatives discussed were started. It was interesting to hear the perspectives from people who arrived later, especially the discussions around the ones that have become irrelevant.

Packaging RPMs

On Tuesday I dropped by the packaging RPMs Working Group session. A small group made up of very focused RDO/Red Hat/SUSE people. The discussions were intense, with Python 2 going End Of Life in under 2 years now.

The current consensus seems to be to create a RPM-based Python 3 gate based on 3.6. There's no supported distro that offers this at the moment, so we will create our own Fedora-based distro with only what we need at the versions we need it. Once RDO is ready with this, it could be moved upstream.

There were some concerns about 3.5 vs 3.6 as the current gating is done on 3.5. Debian also appears to prefer 3.6. In general it was agreed there should not be major differences and generally ok.

The clients must still support Python 2.

There was a little bit of discussion about the stable policy and how it doesn't apply to the specs or the rpm-packaging project (I think the example was with Monasca and the default backend not working (?), so a spec change to modify the backend was backported - which could be considered a feature backport, but since the project isn't under the stable policy remit it could be done).

There was a brief chat at the end about whether there is still interest in packaging services, as opposed to just shipping them as containers. There certainly still seems to be at this point.

Release management

A much more complete summary has already been posted on the list, and I had to leave the session halfway to attend something else.

There seems to be an agreement that it is getting easier to upgrade (although some people still don't want to do it, perhaps an education effort is needed to help with this). People do use the stable point release tags.

The "pressure to upgrade": would Long-Term Support release actually help? Probably it would make it worse. The pressure to upgrade will still be there except there won't be a need to work on it for another year, and it'll make life worse for operators/etc submitting back fixes because it'll take over a year for a patch to make it into their system.

Fast-Forward Upgrade (which is not skip-level upgrades) may help with that pressure... Or not, maybe different problems will come up because of things like not restarting services in between upgrades. It batches things and helps to upgrade faster, but changes nothing.

The conversation moved to one year release cycles just before I left. It seemed to be all concerns and I don't recall active support for the idea. Some of the concerns:

  • Concerns about backports - so many changes
  • Concerns about marketing - it's already hard to keep up with all that's going on, and it's good to show the community is active and stuff is happening more than once a year. It's not that closely tied to releases though, announces could still go out more regularly.
  • Planning when something will land may become even harder as so much can happen in a year
  • It's painful for both people who keep up and people who don't, because there is so much new stuff happening at once.


The sessions began with a retrospective on Wednesday. I was really excited to hear that tripleo-common was going to get unit tests for workflows. I still love the idea of workflows but I found them becoming more and more difficult to work with as they get larger, and difficult to review. Boilerplate gets copy-pasted, can't work without a few changes that are easy to miss unless manually tested and these get missed in reviews all the time.

The next session was about CI. The focus during Queens was on reliability, which worked well although promotions suffered as a result. There were some questions as to whether we should try to prevent people from merging anything when the promotion pipeline is broken but no consensus was really reached.

The Workflows session was really interesting, there's been a lot of Lessons Learnt from our initial attempt with Mistral this last couple of years and it looks like we're setting up for a v2 overhaul that'll get rid of many of the issues we found. Exciting! There was a brief moment of talk about ripping Mistral out and reimplementing everything in Ansible, conclusions unclear.

I didn't take good notes during the other sessions and once the venue closed down (snow!) it became a bit difficult to find people in the hotel and then actually hear them. Most etherpads with the notes are linked from the main TripleO etherpad.

by jpichon at March 09, 2018 11:57 AM

March 07, 2018

Andrew Beekhof

A New Fencing Mechanism (TBD)

Protecting Database Centric Applications

In the same way that some application require the ability to persist records to disk, for some applications the loss of access to the database means game over - more so than disconnection from the storage.

Cinder-volume is one such application and as it moves towards an active/active model, it is important that a failure in one peer does not represent a SPoF. In the Cinder architecture, the API server has no way to know if the cinder- volume process is fully functional - so they will still recieve new requests to execute.

A cinder-volume process that has lost access to the storage will naturally be unable to complete requests. Worse though is loosing access to the database, as this will means the result of an action cannot be recorded.

For some operations this is ok, if wasteful, because the operation will fail and be retried. Deletion of something that was already deleted is usually treated as a success and re-attempted operations for creating volume will return a new volume. However performing the same resize operation twice is highly problematic since the recorded old size no longer matches the actual size.

Even the safe operations may never complete because the bad cinder-volume process may end up being asked to perform the cleanup operations from its own failures, which would result in additional failures.

Additionally, despite not being recommended, some Cinder drivers make use of locking. For those drivers it is just as crucial that any locks held by a faulty or hung peer can be recovered within a finite period of time. Hence the need for fencing.

Since power-based fencing is so dependant on node hardware and there is always some kind of storage involved, the idea of leveraging the SBD[1] ( Storage Based Death ) project’s capabilities to do disk based heartbeating and poison-pills is attractive. When combined with a hardware watchdog, it is an extremely reliable way to ensure safe access to shared resources.

However in Cinder’s case, not all vendors can provide raw access to a small block device on the storage. Additionally, it is really access to the database that needs protecting not the storage. So while useful, it is still relatively easy to construct scenarios that would defeat SBD.

A New Type of Death

Where SBD uses storage APIs to protect applications persisting data to disk, we could also have one based on SQL calls that did the same for Cinder-volume and other database centric applications.

I therefor propose TBD - “Table Based Death” (or “To Be Decided” depending on how you’re wired).

Instead of heartbeating to a designated slot on a block device, the slots become rows in a small table in the database that this new daemon would interact with via SQL.

When a peer is connected to the database, a cluster manager like Pacemaker can use a poison pill to fence the peer in the event of a network, node, or resource level failure. Should the peer ever loose quorum or its connection to the database, surviving peers can assume with a degree of confidence that it will self terminate via the watchdog after a known interval.

The desired behaviour can be derived from the following properties:

  1. Quorum is required to write poison pills into a peer’s slot

  2. A peer that finds a poison pill in its slot triggers its watchdog and reboots

  3. A peer that looses connection to the database won’t be able to write status information to its slot which will trigger the watchdog

  4. A peer that looses connection to the database won’t be able to write a poison pill into another peer’s slot

  5. If the underlying database looses too many peers and reverts to read-only, we won’t be able to write to our slot which triggers the watchdog

  6. When a peer that looses connection to its peers, the survivors would maintain quorum(1) and write a poison pill to the lost node (1) ensuring the peer will terminate due to scenario (2) or (3)

If N seconds is the worst case time a peer would need to either notice a poison pill, or disconnection from the database, and trigger the watchdog. Then we can arrange for services to be recovered after some multiple of N has elasped in the same way that Pacemaker does for SBD.

While TBD would be a valuable addition to a traditional cluster architecture, it is also concievable that it could be useful in a stand-alone configuration. Consideration should therefor be given during the design phase as to how best consume membership, quorum, and fencing requests from multiple sources - not just a particular application or cluster manager.


Just as in the SBD architecture, we need TBD to be configured to use the same persistent store (database) as is being consumed by the applications it is protecting. This is crucial as it means the same criteria that enables the application to function, also results in the node self-terminating if it cannot be satisfied.

However for security reasons, the table would ideally live in a different namespace and with different access permissions.

It is also important to note that significant design challenges would need to be faced in order to protect applications managed by the same cluster that was providing the highly available database being consumed by TBD. Consideration would particularly need to be given to the behaviour of TBD and the applications it was protecting during shudown and cold-start scenarios. Care would need to be taken in order to avoid unnecessary self-fencing operations and that failure responses are not impacted by correctly handling these scenarios.


[1] SBD lives under the ClusterLabs banner but can operate without a traditional corosync/pacemaker stack.

by Andrew Beekhof ( at March 07, 2018 02:11 AM

March 06, 2018

Adam Young

Generating a Callgraph for Keystone

Once I know a starting point for a call, I want to track the other functions that it calls. pycallgraph will generate an image that shows me that.

All this is done inside the virtual env set up by tox at keystone/.tox/py35

I need a stub of a script file in order to run it. I’ll put this in tmp:

from keystone.identity import controllers
from keystone.server import wsgi
from keystone.common import request

def main():

    d = dict()
    r  = request.Request(d)

    c = controllers.UserV3()

if __name__ == '__main__':

To install pycallgraph:

pip install pycallgraph

And to run it:

 pycallgraph  --max-depth 6  graphviz /tmp/ 

It errors out do to auth issues (it is actually rtunning the code, so don’t do this on a production server)

Here is what it generated.

Click to enlarge. Not great, but it is a start.

by Adam Young at March 06, 2018 09:53 PM

Inspecting Keystone Routes

What Policy is enforced when you call a Keystone API? Right now, there is no definitive way to say. However, with some programmatic help, we might be able to figure it out from the source code. Lets start by getting a complete list of the Keystone routes.

In the WSGI framework that Keystone uses, a Route is the object that used to match the URL. For example, when I try to look at the user with UserId abcd1234, I submit a GET request to the URL https://hostname:port/v3/users/abcd1234. The route path is the pattern /users/{user_id}. The WSGI framework handles the parts of the URL prior to that, and eventually needs to pull out a Python function to execute for the route. Here is how we can generate a list of the route paths in Keystone

from keystone.server import wsgi
app = wsgi.initialize_admin_application()
composing = app['/v3'].application.application.application.application.application.application._app.application.application.application.application
for route in composing._router.mapper.matchlist:

I’ll put the output at the end of this post.

That long chain of .application properties is due to the way that the pipeline is built using the paste file. In keystone/etc/keystone-paste.ini we see:

# The last item in this pipeline must be service_v3 or an equivalent
# application. It cannot be a filter.
pipeline = healthcheck cors sizelimit http_proxy_to_wsgi osprofiler url_normalize request_id build_auth_context token_auth json_body ec2_extension_v3 s3_extension service_v3

Each of those pipeline elements are python classes specified earlier in the file, that honor the middleware contract. Most of these can be traced back to the keystone.common.wsgi.Middleware base class, which implements this as __call__ method.

    def __call__(self, request):
        response = self.process_request(request)
        if response:
            return response
        response = request.get_response(self.application)
        return self.process_response(request, response)

The odd middleware out is AuthContextMiddleware which extends from from keystonemiddleware.auth_token.BaseAuthProtocol. See if you can spot the difference:

    def __call__(self, req):
        """Handle incoming request."""
        response = self.process_request(req)
        if response:c
            return response
        response = req.get_response(self._app)
        return self.process_response(response

Yep: self._app.

Here is the output from the above code, executed in the python interpreter. This does not have the Verbs in it yet, but a little more poking should show where they are stored:

>>> for route in composing._router.mapper.matchlist:
...     print(route.routepath)

by Adam Young at March 06, 2018 05:17 PM

March 04, 2018

Rich Bowen


I’m heading home from SnowpenStack and it was quite a ride. As Theirry said in our interview at the end of Friday (coming soon to a YouTube channel near you), rather than spoiling things, the freak storm and subsequent closure of the event venue served to create a shared experience and camaraderie that made it even better.

In the end I believe I got 29 interviews, and I’ll hopefully be supplementing this with a dozen online interviews in the coming weeks.

If you missed your interview, or weren’t at the PTG, please contact me and we’ll set something up. And I’ll be in touch with all of the PTLs who were not already represented in one of my interviews.

A huge thank you to everyone that made time to do an interview, and to Erin and Kendall for making everything onsite go so smoothly.

by rbowen at March 04, 2018 07:52 AM

March 02, 2018

OpenStack In Production (CERN)

Expiry of VMs in the CERN cloud

The CERN cloud resources are used for a variety of purposes from running compute intensive workloads to long running services. The cloud also provides personal projects for each user who is registered to the service. This allows a small quota (5 VMs, 10 cores) where the user can have resources dedicated for their use such as boxes for testing. A typical case would be for the CERN IT Tools training where personal projects are used as sandboxes for trying out tools such as Puppet.

Personal projects have a number of differences compares to other projects in the cloud
  • No non-standard flavors
  • No additional quota can be requested
  • Should not be used for production services
  • VMs are deleted automatically when the person stops being a CERN user
With the number of cloud users increasing to over 3,000, there is a corresponding growth in the number of cores used by personal projects, growing by 1,200 cores in the past year. For cases like training users, there is often the case that the VMs are created and the user then does not remember to delete the resources so they consume cores which could be used for compute capacity to analyse the data from the LHC.

One possible approach would be to reduce the quota further. However, tests such as setting up a Kubernetes cluster with OpenStack Magnum often need several VMs to perform the different roles so this would limit the usefulness of personal projects. The usage of the full quota is also rare.

VM Expiration

Based on a previous service which offered resources on demand (called CVI based on Microsoft SCVMM), the approach was taken to expire personal virtual machines.
  • Users can create virtual machines up to the limit of their quota
  • Personal VMs are marked with an expiry date
  • Prior to their expiry, the user is sent several mails to inform them their VM will expire soon and how to extend it if it is still useful.
  • On expiry, the virtual machine is locked and shutdown. This helps to catch cases where people have forgotten to prolong their VMs.
  • One week later, the virtual machine is deleted, freeing up the resources.


We use Mistral to automate several OpenStack tasks in the cloud (such as regular snapshots and project creation/deletion). This has the benefit of a clean audit log to show what steps worked/failed along with clear input/output states supporting retries and an authenticated cloud cron for scheduling.

Our OpenStack projects have some properties set when they are created. This is used to indicate additional information like the accounting codes to be charged for the usage. There are properties for indicating if the type of project such as personal and if the expiration workflow should apply. Mistral YAQL code can then select resources where expiration applies.

task(retrieve_all_projects) => $.id, name => $.name, enabled => $.enabled, type => $.get('type','none'),expire => $.get('expire','off'))).where($.type='personal').where($.enabled).where($.expire='on')

The expire_at parameter is stored as a VM property. This makes it visible for automation such as CLIs through the openstack client show server CLI.

There are several parts to the process
  • A cron trigger'd workflow which
    • Machines in error state or currently building are ignored
    • A newly created machine which does not have an expiry date set has the expiration date set according to the grace period
    • Sees if any machines are entering close to their expiry time and sends a mail to the owner
    • Checks for invalid settings of the expire_at property (such as people setting it a long way in the future or deleting the property) and restores a reasonable value if this is detected
    • If a machine has reached it's expiry date, it's locked and shutdown
    • If a machine has past it's date by the grace period, it's deleted
    • A workflow, launched by Horizon or from the CLI
      • Retrieves the expire_at value and extends it by the prolongation period
    The user notification is done using a set of mail templates and a dedicated workflow ( This allows templates such as instance reminders to have details about the resources included, such as the example from the mail template.

    The Virtual Machine {instance} from the project {project_name} in the Cloud Infrastructure Service will expire on {expire_date}.

    A couple of changes to Mistral will be submitted upstream
    • Support for HTML mail bodies which allows us to have a nicer looking e-mail for notification with links included
    • Support for BCC/CC on the mail so that the OpenStack cloud administrator e-mail can also be kept on copy when there are notifications
    A few minor changes to Horizon were also done (currently local patches)
    • Display expire_at value on the instance details page
    • Add a 'prolong' action so that instances can be prolonged via the web by using the properties editor to set the date of the expiry (defaulting to the current date with the expiry time). This launches the workflow for prolonging the instance.


    Jose Castro Leon from the CERN cloud team


    by Jose Castro Leon ( at March 02, 2018 11:46 AM

    RDO Blog

    March 1 Blogpost Roundup

    It’s been a busy few weeks of blogging! Thanks as always to those of you who continue to write great content.

    OpenStack Role Assignment Inheritance for CloudForms by Adam Young

    Operators expect to use CloudForms to perform administrative tasks. For this reason, the documentation for OpenStack states that the Keystone user must have an ‘admin’ role. We found at least one case, however, where this was not sufficient. Fortunately, we have a better approach, and one that can lead to success in a wider array of deployments.


    Listening for connections on all ports/any port by Lars Kellogg-Stedman

    On IRC — and other online communities — it is common to use a “pastebin” service to share snippets of code, logs, and other material, rather than pasting them directly into a conversation. These services will typically return a URL that you can share with others so that they can see the content in their browser.


    Grouping aggregation queries in Gnocchi 4.0.x by Lars Kellogg-Stedman

    In this article, we’re going to ask Gnocchi (the OpenStack telemetry storage service) how much memory was used, on average, over the course of each day by each project in an OpenStack environment.


    TripleO deep dive session #12 (config-download) by Carlos Camacho

    This is the 12th release of the TripleO “Deep Dive” sessions. In this session we will have an update for the TripleO ansible integration called config-download. It’s about applying all the software configuration with Ansible instead of doing it with the Heat agents.


    Maximizing resource utilization with Preemptible Instances by Theodoros Tsioutsias

    The CERN cloud consists of around 8,500 hypervisors providing over 36,000 virtual machines. These provide the compute resources for both the laboratory’s physics program but also for the organisation’s administrative operations such as paying bills and reserving rooms at the hostel.


    Testing TripleO on own OpenStack deployment by mrunge

    For some use cases, it’s quite useful to test TripleO deployments on a OpenStack powered cloud, rather than using a baremetal system. The following article will show you how to do it.


    A New Thing by Andrew Beekhof

    If you’re interested in Kubernetes and/or managing replicated applications, such as Galera, then you might also be interested in an operator that allows this class of applications to be managed natively by Kubernetes.


    Two Nodes – The Devil is in the Details by Andrew Beekhof

    tl;dr – Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but while it’s possible to construct good ones, most will have subtle failure modes.


    by Mary Thengvall at March 02, 2018 01:08 AM

    March 01, 2018

    RDO Blog

    RDO Queens Released

    The RDO community is pleased to announce the general availability of the RDO build for OpenStack Queens for RPM-based distributions, CentOS Linux 7 and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Queens is the 17th release from the OpenStack project, which is the work of more than 1600 contributors from around the world (source – ).

    RDO team doing the release at the PTG

    ]2 RDO team doing the release at the PTG

    The release is making its way out to the CentOS mirror network, and should be on your favorite mirror site momentarily.

    The RDO community project curates, packages, builds, tests and maintains a complete OpenStack component set for RHEL and CentOS Linux and is a member of the CentOS Cloud Infrastructure SIG. The Cloud Infrastructure SIG focuses on delivering a great user experience for CentOS Linux users looking to build and maintain their own on-premise, public or hybrid clouds.

    All work on RDO, and on the downstream release, Red Hat OpenStack Platform, is 100% open source, with all code changes going upstream first.

    New and Improved

    Interesting things in the Queens release include:

    • Ironic now supports Neutron routed networks with flat networking and introduces support for Nova traits when scheduling
    • RDO now includes rsdclient, an OpenStack client plugin for Rack Scale Design architecture
    • Support for octaviaclient and Octavia Horizon plugin has been added to improve Octavia service deployments.
    • Tap-as-a-Service (TaaS) network extension to the OpenStack network service (Neutron) has been included.
    • Multi-vendor Modular Layer 2 (ML2) driver networking-generic-switch si now available of operators deploying RDO Queens.

    Other improvements include:

    • Most of the bundled intree tempest plugins have been moved to their own repository during Queens cycle. RDO has adapted plugin packages for these new model.
    • In an effort to improve the quality and reduce the delivery time for our users, RDO keeps refining and automating all required processes needed to build, test and publish the packages included in RDO distribution.

    Note that packages for OpenStack projects with cycle-trailing release models[] will be created after a release is delivered according to the OpenStack Queens schedule. []


    During the Queens cycle, we saw the following new contributors:

    • Aditya Ramteke
    • Jatan Malde
    • Ade Lee
    • James Slagle
    • Alex Schultz
    • Artom Lifshitz
    • Mathieu Bultel
    • Petr Viktorin
    • Radomir Dopieralski
    • Mark Hamzy
    • Sagar Ippalpalli
    • Martin Kopec
    • Victoria Martinez de la Cruz
    • Harald Jensas
    • Kashyap Chamarthy
    • dparalen
    • Thiago da Silva
    • chenxing
    • Johan Guldmyr
    • David J Peacock
    • Sagi Shnaidman
    • Jose Luis Franco Arza

    Welcome to all of you, and thank you so much for participating!

    But, we wouldn’t want to overlook anyone. Thank you to all 76 contributors who participated in producing this release. This list includes commits to rdo-packages and rdo-infra repositories, and is provided in no particular order:

    • Yatin Karel
    • Aditya Ramteke
    • Javier Pena
    • Alfredo Moralejo
    • Christopher Brown
    • Jon Schlueter
    • Chandan Kumar
    • Haikel Guemar
    • Emilien Macchi
    • Jatan Malde
    • Pradeep Kilambi
    • Luigi Toscano
    • Alan Pevec
    • Eric Harney
    • Ben Nemec
    • Matthias Runge
    • Ade Lee
    • Jakub Libosvar
    • Thierry Vignaud
    • Alex Schultz
    • Juan Antonio Osorio Robles
    • Mohammed Naser
    • James Slagle
    • Jason Joyce
    • Artom Lifshitz
    • Lon Hohberger
    • rabi
    • Dmitry Tantsur
    • Oliver Walsh
    • Mathieu Bultel
    • Steve Baker
    • Daniel Mellado
    • Terry Wilson
    • Tom Barron
    • Jiri Stransky
    • Ricardo Noriega
    • Petr Viktorin
    • Juan Antonio Osorio Robles
    • Eduardo Gonzalez
    • Radomir Dopieralski
    • Mark Hamzy
    • Sagar Ippalpalli
    • Martin Kopec
    • Ihar Hrachyshka
    • Tristan Cacqueray
    • Victoria Martinez de la Cruz
    • Bernard Cafarelli
    • Harald Jensas
    • Assaf Muller
    • Kashyap Chamarthy
    • Jeremy Liu
    • Daniel Alvarez
    • Mehdi Abaakouk
    • dparalen
    • Thiago da Silva
    • Brad P. Crochet
    • chenxing
    • Johan Guldmyr
    • Antoni Segura Puimedon
    • David J Peacock
    • Sagi Shnaidman
    • Jose Luis Franco Arza
    • Julie Pichon
    • David Moreau-Simard
    • Wes Hayutin
    • Attila Darazs
    • Gabriele Cerami
    • John Trowbridge
    • Gonéri Le Bouder
    • Ronelle Landy
    • Matt Young
    • Arx Cruz
    • Joe H. Rahme
    • marios
    • Sofer Athlan-Guyot
    • Paul Belanger

    Getting Started

    There are two ways to get started with RDO.

    To spin up a proof of concept cloud, quickly, and on limited hardware, try an All-In-One Packstack installation. You can run RDO on a single node to get a feel for how it works. For a production deployment of RDO, use the TripleO Quickstart and you’ll be running a production cloud in short order.

    Getting Help

    The RDO Project participates in a Q&A service at We also have our for RDO-specific users and operrators. For more developer-oriented content we recommend joining the mailing list. Remember to post a brief introduction about yourself and your RDO story. The mailng lists archives are all available at You can also find extensive documentation on the RDO docs site.

    The #rdo channel on Freenode IRC is also an excellent place to find help and give help.

    We also welcome comments and requests on the CentOS mailing lists and the CentOS and TripleO IRC channels (#centos, #centos-devel, and #tripleo on, however we have a more focused audience in the RDO venues.

    Getting Involved

    To get involved in the OpenStack RPM packaging effort, see the RDO community pages and the CentOS Cloud SIG page. See also the RDO packaging documentation.

    Join us in #rdo on the Freenode IRC network, and follow us at @RDOCommunity on Twitter. If you prefer Facebook, we’re there too, and also Google+.

    by Rich Bowen at March 01, 2018 10:05 AM

    Rich Bowen

    OpenStack PTG and the Beast From The East

    I’m at the OpenStack PTG in Dublin. I’ve started posting some of my videos on my personal YouTube channel – – as well as on my work channel –

    It turns out we’ve planned an event in the middle of the storm of the century, which the press is calling the Beast From The East.

    So far it hasn’t amounted to a lot, but there’s a LOT more snow promised for this afternoon, and the government has warned people to stay off the roads after 4 unless they have a really good reason. Which is disappointing because I have a party planned to start at 6. I’m still trying to get hold of the venue to decide what happens next.

    Yesterday I suddenly realized that I had bought my plane ticket for Sunday instead of Saturday by mistake. I quickly booked another hotel room for Saturday night, closer to the airport. Well, it turns out this may have been the most fortunate travel error I’ve made in a long time, as pretty much everything is cancelled for the next few days, and getting out of here on Saturday might have been impossible.

    For now, we’re just watching the weather reports, and hoping for the best.

    by rbowen at March 01, 2018 09:38 AM

    Carlos Camacho

    My 2nd birthday as a Red Hatter

    This post will be about to speak about my experience working in TripleO as a Red Hatter for the last 2 years. In my 2nd birthday as a Red Hatter, I have learned about many technologies, really a lot… But the most intriguing thing is that here you never stop learning. Not just because you just don’t want to learn new things, instead, is because of the project’s nature, this project… TripleO…

    TripleO (Openstack On Openstack) is a software aimed to deploy OpenStack services using the same OpenStack ecosystem, this means that we will deploy a minimal OpenStack instance (Undercloud) and from there, deploy our production environment (Overcloud)… Yikes! What a mouthful, huh? Put simply, TripleO is an installer which should make integrators/operators/developers lives easier, but the reality sometimes is far away from the expectation.

    TripleO is capable of doing wonderful things, with a little of patience, love, and dedication, your hands can be the right hands to deploy complex environments at easy.

    One of the cool things being one of the programmers who write TripleO, from now on TripleOers, is that many of us also use the software regularly. We are writing code not just because we are told to do it, but because we want to improve it for our own purposes.

    Part of the programmers’ motivation momentum have to do with TripleO’s open‐source nature, so if you code in TripleO you are part of a community.

    Congratulations! As a TripleO user or a TripleOer, you are a part of our community and it means that you’re joining a diverse group that spans all age ranges, ethnicities, professional backgrounds, and parts of the globe. We are a passionate bunch of crazy people, proud of this “little” monster and more than willing to help others enjoy using it as much as we do.

    Getting to know the interface (the templates, Mistral, Heat, Ansible, Docker, Puppet, Jinja, …) and how all components are tight together, probably is one of the most daunting aspects of TripleO for newcomers (and not newcomers). This for sure will raise the blood pressure of some of you who tried using TripleO in the past, but failed miserably and gave up in frustration when it did not behave as expected. Yeah.. sometimes that “$h1t” happens…

    Although learning TripleO isn’t that easy, the architecture updates, the decoupling of the role services “compostable roles”, the backup and restore strategies, the integration of Ansible among many others have made great strides toward alleviating that frustration, and the improvements continue through to today.

    So this is the question…

    Is TripleO meant to be “fast to use” or “fast to learn”?

    There is a significant way of describing software products, but we need to know what our software will be used for… TripleO is designed to work at scale, it might be easier to deploy manually a few controllers and computes, but what about deploying 100 computes, 3 controllers and 50 cinder nodes, all of them configured to be integrated and work as one single “cloud”? Buum!. So there we find the TripleO benefits if we want to make it scale we need to make it fast to use…

    This means that we will find several customizations, hacks, workarounds, to make it work as we need it.

    The upside to this approach is that TripleO evolved to be super-ultra-giga customizable so operators are enabled to produce great environments blazingly fast..

    The downside, Jaja, yes.. there is a downside “or several”. As with most things that are customized, TripleO became somewhat difficult for new people to understand. Also, it’s incredibly hard to test all the possible deployments, and when a user does non-standard or not supported customizations, the upgrades are not as intuitive as they need…

    This trade‐off is what I mean when I say “fast to use versus fast to learn.” You can be extremely productive with TripleO after you understand how it thinks “yes, it thinks”.

    However, your first few deployments and patches might be arduous. Of course, alleviating that potential pain is what our work is about. IMHO the pros are more than the cons and once you find a niche to improve it will be a really nice experience.

    Also, we have the TripleO YouTube channel a place to push video tutorials and deep dive sessions driven by the community for the community.

    For the Spanish community we have a 100% translated TripleO UI, go to and help us to reach as many languages as possible!!! was born on July 5th of 2016 (first GitHub commit), yeah is my way of expressing my gratitude to the community doing some CtrlC + CtrlV recipes to avoid the frustration of working with TripleO and not having something deployed and easy to be used ASAP.

    Anstack does not have much traffic but it reached, the TripleO cheatsheets were on and FOSDEM, so in general, is really nice. When people reference your writings anywhere. Maybe in the future can evolve to be more related to ANsible and openSTACK ;) as TripleO is adding more and more support for Ansible.

    What about Red Hat? Yeahp, I have a long time speaking about the project but haven’t spoken about the company making it all real. Red Hat is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies.

    There is a strong feeling of belonging in Red Hat, you are part of a team, a culture and you are able to find a perfect balance between your work and life. Also, having all people from all over the globe makes a perfect place for sharing ideas and collaborate. Not all of it is good, i.e. Working mostly remotely in upstream communities can be really hard to manage if you are not 100% sure about the tasks that need to be done.

    Keep rocking and become part of the TripleO community!

    by Carlos Camacho at March 01, 2018 12:00 AM

    February 28, 2018

    Adam Young

    OpenStack Role Assignment Inheritance for CloudForms

    Operators expect to use CloudForms to perform administrative tasks. For this reason, the documentation for OpenStack states that the Keystone user must have an ‘admin’ role. We found at least one case, however, where this was not sufficient. Fortunately, we have a better approach, and one that can lead to success in a wider array of deployments.


    CloudForms uses the role assignments for the give user account to enumerate the set of projects. Internally it creates a representation of these projects to be used to track resources. However, The way that ‘admin’ is defined on OpenStack is tied to a single project. This means that CloudForms really has no way to ask “what projects can this user manage?” Now, while admin anywhere is admin everywhere so you would not think that you need to enumeration projects, but it turns out that some of the more complex operations, such as mounting a volume, has to cross service boundaries, and need the project abstraction to link the sets of operations. CloudForms design did not see this disconnect, and so some of those operations fail.

    Lets assume, for the moment, that a user had to have a role on project in order to perform operations on that project. The current admin-everywhere approach would break. What CloudForms would require is an automated way to give a user a role on a project as soon as that project was created. It turns out that CloudForms is not the only thing that has this requirement.

    Role Assignment Inheritance

    Keystone projects do not have to be organized as a flat collection. They can be nested into a tree form. This is called “Hierarchical Multitenancy.” Added to that, a role can be assigned to a user or group on parent project and that role assignment is inherited down the tree. This is called “Role Assignment Inheritance.”

    This presentation, while old, does a great job of putting the details together.

    You don’t need to do anything different in your project setup to take advantage of this mechanism. Here’s something that is subtle: a Domain IS A project. Every project is already in a domain, and thus has a parent project. Thus, you can assign a user a role on the domain-as-a-project, and they will have that role on every project inside that domain.

    Sample Code

    Here is in command line form.

    openstack role add --user CloudAdmin --user-domain Default --project Default --project-domain Default --inherited admin

    Lets take those arguments step by step.

    --user CloudAdmin  --user-domain Default

    This is the user that CloudForms is using to connect to Keystone and OpenStack. Every user is owned by a domain, and this user is owned by the Default” domain.

    --project Default --project-domain Default

    This is blackest of magic. The Default domain IS-A project. So it owns itself.


    A role assignment is either on a project OR on all its subprojects. So, the user does not actually have a role that is usable against the Default DOMAIN-AS-A-PROJECT, but only on all odf the subordinate projects. This might seem strange, but it was built this way for exactly this reason: being able to distinguish between levels of a hierarchy.


    This is the role name.


    With this role assignment, the CloudForms Management Engine instance can perform all operations on all projects within the default domain. If you add another domain to manage a separate set of projects, you would need to perform this same role assignment on the new domain as well.

    I assume this is going to leave people with a lot of questions. Please leave comments, and I will try to update this with any major concepts that people want made lucid.

    by Adam Young at February 28, 2018 07:04 PM

    February 27, 2018

    Lars Kellogg-Stedman

    Listening for connections on all ports/any port

    On IRC -- and other online communities -- it is common to use a "pastebin" service to share snippets of code, logs, and other material, rather than pasting them directly into a conversation. These services will typically return a URL that you can share with others so that they can see the …

    by Lars Kellogg-Stedman at February 27, 2018 05:00 AM

    February 26, 2018

    Lars Kellogg-Stedman

    Grouping aggregation queries in Gnocchi 4.0.x

    In this article, we're going to ask Gnocchi (the OpenStack telemetry storage service) how much memory was used, on average, over the course of each day by each project in an OpenStack environment.


    I'm working with an OpenStack "Pike" deployment, which means I have Gnocchi 4.0.x. More …

    by Lars Kellogg-Stedman at February 26, 2018 05:00 AM

    February 24, 2018


    Awesome things in software engineering: open source

    This is part of a blog series highlighting awesome things in software engineering because not everything has to be depressing, about bugs, vulnerabilities, outages or deadlines. If you’d like to collaborate and write about awesome things in software engineering too, let’s chat: reach out on Twitter or LinkedIn. What’s this blog series about ? Between you and me, software engineering isn’t always fun. You’re not always working on what you like.

    February 24, 2018 12:00 AM

    February 23, 2018

    Julie Pichon

    Migrated to Pelican

    After 8 years of maintaining my lil' custom Django blog, it's time for a change! I'd been thinking about migrating for a while. After the first couple of years of excitement I started falling further and further behind framework upgrades, and my cute anti-spam system kicked the bucket a couple of years back, even though there never was much conversation on the blog. Drop me an email or a tweet if you want to chat about something here :)

    I'd been postponing the migration because I thought it would be real painful to migrate both the content and keep the URL format the same, especially for a custom platform. It turned out to be really easy. Pelican rocks!

    Migrating the content

    Pelican comes with an import tool that supports bland little feeds like mine. By default my feed only displays 10 entries but since it's my code I just modified it locally to show them all. That probably ended up being one of the least straightforward parts of the process actually. I was super excited about Django when I first created the blog but not too familiar with how to manage Python dependencies. Thus, although I did write down the dependency names in a text file I wasn't forward looking enough to include version numbers. pip freeze is my friend now. Thankfully I only had a couple of plugins to play guess-what-version at.

    I did end up making a couple of changes to Pelican locally so it would work better with my content (yay open-source).

    First, to avoid the <pre> code snippets getting mangled with no linebreaks I ended up commenting out a few lines in fields2pelican() that look like they're meant to ensure the validity of the original HTML. I was using a wizard in the old blog so there's no reason it shouldn't be. I wasn't too worried about it and didn't notice side-effects during the migration.

    Secondly, the files weren't created with the correct slugs and filenames, which caused some issues when rewriting the URLs. It looks like the feed parser doesn't look at the real slug so I figured out where the URL was at in feed2fields() (in for me) and changed the slug = slugify(entry.title) line to break down that value and extract the real slug.

    Adjusting the content

    Now, I use tags quite liberally and on the feed that was marked with "Tagged with: blah, bleh, bloh" at the end of an article. I wrote a short script to scrap that line from the rst files created in the previous step, add the discovered tags to :tags: in the metadata and remove the 'Tagged with' line. That was fun! The script is ugly and bugs were found along the way, but it did the job and now it even works when there are so many tags on an entry that they're spread over several lines ;)

    Rewriting the URLs

    I don't know if I should even give this a heading. Figuring out rewrite rules was giving me cold sweats but it turns out Pelican gives you handy settings out of the box to have your URLs look like whatever you want. It's really easy. I mean, I don't think I broke anything?!

    Except the feeds, but after some thinking that's something I decided to do on purpose. The blog has ended up aggregated in a lot of places I don't even remember, and I was really concerned about 8 years of entries somehow getting newer timestamps and flooding the planets I'm on. So, brand new feeds. I'll update the two or three planets I remember being a part of, and the others as I find them or they find me again :)

    Going mad with sed

    After putting what I had so far on a temporary place online, a couple of additional issues popped up:

    • When the feed was imported, some of the internal URLs were copied as full URLs rather than relative ones. That means there were a bunch of references to http://localhost:8000, since I'd used a local copy of the feed.
    • The theme, images and most of the links didn't work because they expected the site to start at / but I was working off a temporary sub-directory for the test version.

    I've never used sed so much in my life. I'm going to be an expert at it for the next three days at least, until I forget it all again. Here, writing some of them down now for future-me when how to use groups becomes a distant memory:

    # Fix the images!
    $ for f in `grep -rl "image:: http:\/\/localhost:8000" *`; do  sed -i 's/image:: http:\/\/localhost:8000/image:: {filename}/g' "$f"; done
    # Fix the internal links!
    $ sed -i 's/<\/blog\/[0-9]*\/[0-9]*/<{filename}\/Tech/g' content/Tech/*
    $ sed -i 's/\({filename.*\)\/>`__/\1.rst>`__/g'
    # Fix the tags!
    $ for f in `grep -rl /tag/ *`; do  sed -i 's/\/tag\/\(.*\)\//{tag}\1/g' $f; done

    I think I had to do a bunch of other ad-hoc modifications. I also expect to find more niggles which I'll fix as I see them, but for now I'm happy with the current shape of things. I can't overstate how much easier this was than I expected. The stuff that took the most time (remembering how to run the custom blog code locally, importing tags, sedding all the things) was nearly all self-inflicted, and the whole process was over in a couple of evenings.

    Blogging from emacs

    Sure feels nice.

    by jpichon at February 23, 2018 07:12 PM


    Rebranding Ansible Run Analysis to ARA Records Ansible

    So I got an idea recently… Let’s rebrand Ansible Run Analysis to ARA records Ansible. If you’d like to review and comment on the code change, you can do so here: Why ? I watched the last season of Sillicon Valley recently. The series, while exaggerated, provides a humorous look at the world of startups. I don’t have any plans on creating a startup but I love that it makes you think about things like needing a clever name or how you would do a proper “elevator” pitch to get funding.

    February 23, 2018 12:00 AM

    Carlos Camacho

    TripleO deep dive session #12 (config-download)

    This is the 12th release of the TripleO “Deep Dive” sessions

    Thanks to James Slagle for this new session, in which he will describe and speak about a feature called config-download.

    In this session we will have an update for the TripleO ansible integration called config-download. It’s about aplying all the software configuration with Ansible instead of doing it with the Heat agents.

    So please, check the full session content on the TripleO YouTube channel.

    Please check the sessions index to have access to all available content.

    by Carlos Camacho at February 23, 2018 12:00 AM

    February 21, 2018

    OpenStack In Production (CERN)

    Maximizing resource utilization with Preemptible Instances


    The CERN cloud consists of around 8,500 hypervisors providing over 36,000
    virtual machines. These provide the compute resources for both the laboratory's
    physics program but also for the organisation's administrative operations such
    as paying bills and reserving rooms at the hostel.

    The resources themselves are generally ordered once to twice a year with servers being kept for around 5 years. Within the CERN budget, the resource planning teams looks at:
    • The resources required to run the computing services requirements for the CERN laboratory. These are projected using capacity planning trend data and upcoming projects such as video conferencing.
    With the installation and commissioning of thousands of servers concurrently
    (along with their associated decommissioning 5 years later), there are scenarios
    to exploit underutilised servers. Programs such as LHC@Home are used but we have also been interested to expand the cloud to provide virtual machine instances which can be rapidly terminated in the event of
    • Resources being required for IT services as they scale out for events such as a large scale web cast on a popular topic or to provision instances for a new version of an application.
    • Partially full hypervisors where the last remaining cores are not being requested (the Tetris problem).
    • Compute servers at the end of their lifetime which are used to the full before being removed from the computer centre to make room for new deliveries which are more efficient and in warranty.
    The characteristics of this workload is that it should be possible to stop an
    instance within a short time (a few minutes) compared to a traditional physics job.

    Resource Management In Openstack

    Operators use project quotas for ensuring the fair sharing of their infrastructure. The problem with this, is that quotas pose as hard limits.This
    leads to actually dedicating resources for workloads even if they are not used
    all the time or to situations where resources are not available even though
    there is quota still to use.

    At the same time, the demand for cloud resources is increasing rapidly. Since
    there is no cloud with infinite capabilities, operators need a way to optimize
    the resource utilization before proceeding to the expansion of their infrastructure.

    Resources in idle state can occur, showing lower cloud utilization than the full
    potential of the acquired equipment while the users’ requirements are growing.

    The concept of Preemptible Instances can be the solution to this problem. These
    type of servers can be spawned on top of the project's quota, making use of the
    underutilised  capabilities. When the resources are requested by tasks with
    higher priority (such as approved quota), the preemptible instances are
    terminated to make space for the new VM.

    Preemptible Instances with Openstack

    Supporting preemptible instances, would mirror the AWS Spot Market and the
    Google Preemptible Instances. There are multiple things to be addressed here as
    part of an implementation with OpenStack, but the most important can be reduced to these:
    1. Tagging Servers as Preemptible
    In order to be able to distinguish between preemptible and non-preemptible
    servers, there is the need to tag the instances at creation time. This property
    should be immutable for the lifetime of the servers.
    1. Who gets to use preemptible instances
    There is also the need to limit which user/project is allowed to use preemptible
    instances. An operator should be able to choose which users are allowed to spawn this type of VMs.
    1. Selecting servers to be terminated
    Considering that the preemptible instances can be scattered across the different cells/availability zones/aggregates, there has to be “someone” able to find the existing instances, decide the way to free up the requested resources according to the operator’s needs and, finally, terminate the appropriate VMs.
    1. Quota on top of project’s quota
    In order to avoid possible misuse, there could to be a way to control the amount of preemptible resources that each user/project can use. This means that apart from the quota for the standard resource classes, there could be a way to enforce quotas on the preemptible resources too.

    OPIE : IFCA and Indigo Dataclouds

    In 2014, there were the first investigations into approaches by Alvaro Lopez
    from IFCA (
    As part of the EU Indigo Datacloud project, this led to the development of the
    OpenStack Pre-Emptible Instances package (
    This was written up in a paper to Journal of Physics: Conference Series
    ( and
    presented at the OpenStack summit (

    Prototype Reaper Service

    At the OpenStack Forum during a recent OpenStack summit, a detailed discussion took place on how spot instances could be implemented without significant changes to Nova. The ideas were then followed up with the OpenStack Scientific Special Interest Group.

    Trying to address the different aspects of the problem, we are currently
    prototyping a “Reaper” service. This service acts as an orchestrator for
    preemptible instances. It’s sole purpose is to decide the way to free up the
    preemptible resources when they are requested for another task.

    The reason for implementing this prototype, is mainly to help us identify
    possible changes that are needed in Nova codebase to support Preemptible

    More on this WIP can be found here: 


    The concept of Preemptible Instances gives operators the ability to provide a
    more "elastic" capacity. At the same time, it enables the handling of increased
    demand for resources, with the same infrastructure, by maximizing the cloud

    This type of servers is perfect for tasks/apps that can be terminated at any
    time, enabling the users to take advantage of extra cpu power on demand without the fixed limits that quotas enforce.

    Finally, here in CERN, there is an ongoing effort to provide a prototype
    orchestrator for Preemptible Servers with Openstack, in order to pinpoint the
    changes needed in Nova to support this feature optimally. This could also be
    available in future for other OpenStack clouds in use by CERN such as the
    T-Systems Open Telekom Cloud through the Helix Nebula Open Science Cloud


    • Theodoros Tsioutsias (CERN openlab fellow working on Huawei collaboration)
    • Spyridon Trigazis (CERN)
    • Belmiro Moreira (CERN)


    by Theodoros Tsioutsias ( at February 21, 2018 12:37 PM

    February 16, 2018

    Matthias Runge

    Testing TripleO on own OpenStack deployment

    For some use cases, it's quite useful to test TripleO deployments on a OpenStack powered cloud, rather than using a baremetal system. The following article will show you, how to do it:

    We're going to use tripleo-quickstart . This also assumes, you have downloaded your OpenStack handy and stored …

    by mrunge at February 16, 2018 08:30 AM

    Andrew Beekhof

    A New Thing

    I made a new thing.

    If you’re interested in Kubernetes and/or managing replicated applications, such as Galera, then you might also be interested in an operator that allows this class of applications to be managed natively by Kubernetes.

    There is plenty to read on why the operator exists, how replication is managed and the steps to install it if you’re interested in trying it out.

    There is also a screencast that demonstrates the major concepts:


    Feedback welcome.

    by Andrew Beekhof ( at February 16, 2018 03:32 AM

    February 15, 2018

    Andrew Beekhof

    Two Nodes - The Devil is in the Details

    tl;dr - Many people love 2-node clusters because they seem conceptually simpler and 33% cheaper, but while it’s possible to construct good ones, most will have subtle failure modes

    The first step towards creating any HA system is to look for and try to eliminate single points of failure, often abbreviated as SPoF.

    It is impossible to eliminate all risk of downtime and especially when one considers the additional complexity that comes with introducing additional redunancy, concentrating on single (rather than chains of related and therefor decreasingly probable) points of failure is widely accepted as a suitable compromise.

    The natural starting point then is to have more than one node. However before the system can move services to the surviving node after a failure, in general, it needs to be sure that they are not still active elsewhere.

    So not only are we looking for SPoFs, but we are also looking to balance risks and consequences and the calculus will be different for every deployment [1]

    There is no downside if a failure causes both members of a two node cluster to serve up the same static website. However its a very different story if it results in both sides independently managing a shared job queue or providing uncoordinated write access to a replicated database or shared filesystem.

    So in order to prevent a single node failure from corrupting your data or blocking recovery, we rely on something called fencing.


    At it s heart, fencing turns a question Can our peer cause data corruption? into an answer no by isolating it both from incoming requests and persistent storage. The most common approach to fencing is to power off failed nodes.

    There are two categories of fencing which I will call direct and indirect but could equally be called active and passive. Direct methods involve action on the part of surviving peers, such interacting with an IPMI or iLO device, whereas indirect relies on the failed node to somehow recognise it is in an unhealthy state (or is at least preventing remaining members from recovering) and signal a hardware watchdog to panic the machine.

    Quorum helps in both these scenarios.

    Direct Fencing

    In the case of direct fencing, we can use it to prevent fencing races when the network fails. By including the concept of quorum, there is enough information in the system (even without connectivity to their peers) for nodes to automatically know whether they should initiate fencing and/or recovery.

    Without quorum, both sides of a network split will rightly assume the other is dead and rush to fence the other. In the worst case, both sides succeed leaving the entire cluster offline. The next worse is a death match , a never ending cycle of nodes coming up, not seeing their peers, rebooting them and initiating recovery only to be rebooted when their peer goes through the same logic.

    The problem with fencing is that the most commonly used devices become inaccessible due to the same failure events we want to use them to recover from. Most IPMI and iLO cards both loose power with the hosts they control and by default use the same network that is causing the peers to believe the others are offline.

    Sadly the intricacies of IPMI and iLo devices is rarely a consideration at the point hardware is being purchased.

    Indirect Fencing

    Quorum is also crucial for driving indirect fencing and, when done right, can allow survivors to safely assume that missing nodes have entered a safe state after a defined period of time.

    In such a setup, the watchdog’s timer is reset every N seconds unless quorum is lost. If the timer (usually some multiple of N) expires, then the machine performs an ungraceful power off (not shutdown).

    This is very effective but without quorum to drive it, there is insufficient information from within the cluster to determine the difference between a network outage and the failure of your peer. The reason this matters is that without a way to differentiate between the two cases, you are forced to choose a single behaviour mode for both.

    The problem with choosing a single response is that there is no course of action that both maximises availability and prevents corruption.

    • If you choose to assume the peer is alive but it actually failed, then the cluster has unnecessarily stopped services.

    • If you choose to assume the peer is dead but it was just a network outage, then the best case scenario is that you have signed up for some manual reconciliation of the resulting datasets.

    No matter what heuristics you use, it is trivial to construct a single failure that either leaves both sides running or where the cluster unnecessarily shuts down the surviving peer(s). Taking quorum away really does deprive the cluster of one of the most powerful tools in its arsenal.

    Given no other alternative, the best approach is normally to sacrificing availability. Making corrupted data highly available does no-one any good and manually reconciling diverant datasets is no fun either.


    Quorum sounds great right?

    The only drawback is that in order to have it in a cluster with N members, you need to be able to see N/2 + 1 of your peers. Which is impossible in a two node cluster after one node has failed.

    Which finally brings us to the fundamental issue with two-nodes:

    quorum does not make sense in two node clusters, and

    without it there is no way to reliably determine a course of action that both maximises availability and prevents corruption

    Even in a system of two nodes connected by a crossover cable, there is no way to conclusively differentiate between a network outage and a failure of the other node. Unplugging one end (who’s likelihood is surely proportional to the distance between the nodes) would be enough to invalidate any assumption that link health equals peer node health.

    Making Two Nodes Work

    Sometimes the client can’t or wont make the additional purchase of a third node and we need to look for alternatives.

    Option 1 - Add a Backup Fencing Method

    A node’s iLO or IPMI device represents a SPoF because, by definition, if it fails the survivors cannot use it to put the node into a safe state. In a cluster of 3 nodes or more, we can mitigate this a quorum calculation and a hardware watchdog (an indirect fencing mechanism as previously discussed). In a two node case we must instead use network power switches (aka. power distribution units or PDUs).

    After a failure, the survivor first attempts to contact the primary (the built-in iLO or IPMI) fencing device. If that succeeds, recovery proceeds as normal. Only if the iLO/IPMI device fails is the PDU invoked and assuming it succeeds, recovery can again continue.

    Be sure to place the PDU on a different network to the cluster traffic, otherwise a single network failure will prevent access to both fencing devices and block service recovery.

    You might be wondering at this point… doesn’t the PDU represent a single point of failure? To which the answer is “definitely“.

    If that risk concerns you, and you would not be alone, connect both peers to two PDUs and tell your cluster software to use both when powering peers on and off. Now the cluster remains active if one PDU dies, and would require a second fencing failure of either the other PDU or an IPMI device in order to block recovery.

    Option 2 - Add an Arbitrator

    In some scenarios, although a backup fencing method would be technically possible, it is politically challenging. Many companies like to have a degree of separation between the admin and application folks, and security conscious network admins are not always enthusiastic about handing over the usernames and passwords to the PDUs.

    In this case, the recommended alternative is to create a neutral third-party that can supplement the quorum calculation.

    In the event of a failure, a node needs to be able to see ether its peer or the arbitrator in order to recover services. The arbitrator also includes to act as a tie-breaker if both nodes can see the arbitrator but not each other.

    This option needs to be paired with an indirect fencing method, such as a watchdog that is configured to panic the machine if it looses connection to its peer and the arbitrator. In this way, the survivor is able to assume with reasonable confidence that its peer will be in a safe state after the watchdog expiry interval.

    The practical difference between an arbitrator and a third node is that the arbitrator has a much lower footprint and can act as a tie-breaker for more than one cluster.

    Option 3 - More Human Than Human

    The final approach is for survivors to continue hosting whatever services they were already running, but not start any new ones until either the problem resolves itself (network heals, node reboots) or a human takes on the responsibility of manually confirming that the other side is dead.

    Bonus Option

    Did I already mention you could add a third node? We test those a lot :-)

    Two Racks

    For the sake of argument, lets imagine I’ve convinced you the reader on the merits of a third node, we must now consider the physical arrangement of the nodes. If they are placed in (and obtain power from), the same rack, that too represents a SPoF and one that cannot be resolved by adding a second rack.

    If this is surprising, consider what happens when the rack with two nodes fails and how the surviving node would differentiate between this case and a network failure.

    The short answer is that it can’t and we’re back to having all the problems of the two-node case. Either the survivor:

    • ignores quorum and incorrectly tries to initiate recovery during network outages (whether fencing is able to complete is a different story and depends on whether PDU is involved and if they share power with any of the racks), or

    • respects quorum and unnecessarily shuts itself down when its peer fails

    Either way, two racks is no better than one and the nodes must either be given independant supplies of power or be distributed accross three (or more depending on how many nodes you have) racks.

    Two Datacenters

    By this point the more risk averse readers might be thinking about disaster recovery. What happens when an asteroid hits the one datacenter with our three nodes distributed across three different racks? Obviously Bad Things(tm) but depending on your needs, adding a second datacenter might not be enough.

    Done properly, a second datacenter gives you a (reasonably) up-to-date and consistent copy of your services and their data. However just like the two- node and two-rack scenarios, there is not enough information in the system to both maximise availability and prevent corruption (or diverging datasets). Even with three nodes (or racks), distributing them across only two datacenters leaves the system unable to reliably make the correct decision in the (now far more likely) event that the two sides cannot communicate.

    Which is not to say that a two datacenters solution is never appropriate. It is not uncommon for companies to want a human in the loop before taking the extraordinary step of failing over to a backup datacenter. Just be aware that if you want automated failure, you’re either going to need a third datacenter in order for quorum to make sense (either directly or via an arbitrator) or find a way to reliably power fence an entire datacenter.


    [1] Not everyone needs redundant power companies with independent transmission lines. Although the paranoia paid off for at least one customer when their monitoring detected a failing transformer. The customer was on the phone trying to warn the power company when it finally blew.

    by Andrew Beekhof ( at February 15, 2018 11:52 PM

    February 13, 2018

    RDO Blog

    Feb 13 Community Blogpost Roundup

    Here’s the latest edition of the community blog round-up. Thanks for your contributions!

    Deleting an image on RDO by Adam Young

    So I uploaded a qcow image… but did it wrong. It was tagged as raw instead of qcow, and now I want it gone. Only problem… it is stuck.


    Keystonerc for RDO cloud by Adam Young

    If you are using RDO Cloud and want to do command line Ops, here is the outline of a keystone.rc file you can use to get started.


    Debugging TripleO revisited – Heat, Ansible & Puppet by Steve Hardy

    Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail. In recent releases we’ve made some major changes to the TripleO architecture. In this post I’d like to provide a refreshed tutorial on typical debug workflow, primarily focusing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.


    Listing iptables rules with line numbers by Lars Kellogg-Stedman

    You can list iptables rules with rule numbers using the --line-numbers option, but this only works in list (-L) mode. I find it much more convenient to view rules using the output from iptables -S or iptables-save.


    FOSDEM ’18, and the CentOS Brussels Dojo by Rich Bowen

    The first weekend in February always finds me in Brussels for FOSDEM and the various associated events, and this year is no exception.


    Matching Create and Teardown in an Ansible Role by Adam Young

    Nothing lasts forever. Except some developer setups that no-one seems to know who owns, and no one is willing to tear down. I’ve tried to build the code to clean up after myself into my provisioning systems. One pattern I’ve noticed is that the same data is required for building and for cleaning up a cluster. When I built Ossipee, each task had both a create and a teardown stage. I want the same from Ansible. Here is how I’ve made it work thus far.


    Deploying an image on OpenStack that is bigger than the available flavors. by Adam Young

    Today I tried to use our local OpenStack instance to deploy CloudForms Management Engine (CFME). Our OpenStack deployment has a set of flavors that all are defined with 20 GB Disks. The CFME image is larger than this, and will not deploy on the set of flavors. Here is how I worked around it.


    Freeing up a Volume from a Nova server that errored by Adam Young

    Trial and error. Its a key part of getting work done in my field, and I make my share of errors. Today, I tried to create a virtual machine in Nova using a bad glance image that I had converted to a bootable volume:


    by Mary Thengvall at February 13, 2018 06:03 PM

    February 10, 2018

    Adam Young

    Deleting an image on RDO

    So I uploaded a qcow image…but did it wrong. It was tagged as raw instead of qcow, and now I want it gone. Only problem….it is stuck.

    $ openstack image delete rhel-server-7.4-update-4-x86_64
    Failed to delete image with name or ID 'rhel-server-7.4-update-4-x86_64': 409 Conflict
    Image 2e77971e-7746-4992-8e1e-7ce1be8528f8 could not be deleted because it is in use: The image cannot be deleted because it is in use through the backend store outside of Glance.

    But….I deleted all of the instances connected to it! Come On!

    Answer is easy once the code-rage wears off…

    When I created a server based on this image, it created a new volume. That volume is locking the image into place.

    $ openstack volume list
    | ID                                   | Name | Status    | Size | Attached to                      |
    | 97a15e9c-2744-4f31-95f3-a13603e49b6d |      | error     |    1 |                                  |
    | c9337612-8317-425f-b313-f8ba9336f1cc |      | available |    1 |                                  |
    | 9560a18f-bfeb-4964-9785-6e76fa720892 |      | in-use    |    9 | Attached to showoff on /dev/vda  |
    | 0188edd7-7e91-4a80-a764-50d47bba9978 |      | in-use    |    9 | Attached to test1 on /dev/vda    |

    See that error? I think its that one. I can’t confirm now, as I also deleted the available one, as I didn’t need it, either.

    $ openstack volume delete 97a15e9c-2744-4f31-95f3-a13603e49b6d
    $ openstack volume delete c9337612-8317-425f-b313-f8ba9336f1cc
    $ openstack image delete rhel-server-7.4-update-4-x86_64

    And that last command succeeded.

    $ openstack image show  rhel-server-7.4-update-4-x86_64
    Could not find resource rhel-server-7.4-update-4-x86_64

    by Adam Young at February 10, 2018 12:06 AM

    February 09, 2018

    Adam Young

    Keystonerc for RDO cloud

    If you are using RDO Cloud and want to do command line Ops, here is the outline of a keystone.rc file you can use to get started.

    unset $( set | awk '{FS="="} /^OS_/ {print $1}' )
    export OS_AUTH_URL=
    export OS_USERNAME={username}
    export OS_PASSWORD={password}
    export OS_USER_DOMAIN_NAME=Default
    export OS_PROJECT_DOMAIN_NAME=Default
    export OS_PROJECT_NAME={projectname}

    You might have been given a different AUTH URL to use. The important parts are appending the /v3/ and explicitly setting the OS_IDENTITY_API_VERSION=3. Setting both is overkill, but you can never have too much over kill.

    Once you have this set, source it, and you can run:

    $ openstack image list
    | ID                                   | Name                                      | Status |
    | af47a290-3af3-4e46-bb56-4f250a3c20a4 | CentOS-6-x86_64-GenericCloud-1706         | active |
    | b5446129-8c75-4ce7-84a3-83756e5f1236 | CentOS-7-x86_64-GenericCloud-1701         | active |
    | 8f41e8ce-cacc-4354-a481-9b9dba4f6de7 | CentOS-7-x86_64-GenericCloud-1703         | active |
    | 42a43956-a445-47e5-89d0-593b9c7b07d0 | CentOS-7-x86_64-GenericCloud-1706         | active |
    | ffff3320-1bf8-4a9a-a26d-5abd639a6e33 | CentOS-7-x86_64-GenericCloud-1708         | active |
    | 28b76dd3-4017-4b46-8dc9-98ef1cb4034f | CentOS-7-x86_64-GenericCloud-1801-01      | active |
    | 2e596086-38c9-41d1-b1bd-bcf6c3ddbdef | CentOS-Atomic-Host-7.1706-GenericCloud    | active |
    | 1dfd12d7-6f3a-46a6-ac69-03cf870cd7be | CentOS-Atomic-Host-7.1708-GenericCloud    | active |
    | 31e9cf36-ba64-4b27-b5fc-941a94703767 | CentOS-Atomic-Host-7.1801-02-GenericCloud | active |
    | c59224e2-c5df-4a86-b7b6-49556d8c7f5c | bmc-base                                  | active |
    | 5dede8d3-a723-4744-97df-0e6ca93f5460 | ipxe-boot                                 | active |

    by Adam Young at February 09, 2018 10:17 PM

    Steve Hardy

    Debugging TripleO revisited - Heat, Ansible & Puppet

    Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail.

    In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments.  I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.

    In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.

    We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet.  In a future post I'll also cover the basics of debugging containerized deployments.

    The TripleO deploy workflow, overview

    A typical TripleO deployment consists of several discrete phases, which are run in order:

    Provisioning of the nodes

    1. A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
    2. Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
    3. Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
    4. Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
    5. The nodes are provisioned by Ironic

    This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).

    Host preparation

    The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):

    1. The node networking is configured, via the os-net-config tool
    2. We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
    3. We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)

    Service deployment, step-by-step configuration

    The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:

    1. We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
    2. We run "" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
    3. We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
    4. We run again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.

    Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).

    Below is a diagram which illustrates this step-by-step deployment workflow:
    TripleO Service configuration workflow

    The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.


    Debugging first steps - what failed?

    Heat Stack create failed.

    Ok something failed during your TripleO deployment, it happens to all of us sometimes!  The next step is to understand the root-cause.

    My starting point after this is always to run:

    openstack stack failures list --long <stackname>

    (undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud
    resource_type: OS::Heat::StructuredDeployment
    physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106
    status: CREATE_FAILED
    status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
    deploy_stdout: |

    PLAY [localhost] ***************************************************************


    TASK [Run puppet host configuration for step 1] ********************************
    ok: [localhost]

    TASK [debug] *******************************************************************
    fatal: [localhost]: FAILED! => {
    "changed": false,
    "failed_when_result": true,
    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
    "Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8",
    "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"
    to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry

    PLAY RECAP *********************************************************************
    localhost : ok=18 changed=12 unreachable=0 failed=1

    We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource

    • The error was on one of the Controllers (ControllerDeployment)
    • The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
    • The failure was during the first step (Step1.0)
    Then we see more clues in the deploy_stdout, ansible failed running the task which runs puppet on the host, it looks like a problem with the puppet code.

    With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:

    (undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id
    (undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89
    | 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0 | ACTIVE | ctlplane= | overcloud-full | oooq_control |

    Ok so puppet failed while running via ansible on overcloud-controller-0.


    Debugging via Ansible directly

    Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.

    Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.

    (undercloud) [stack@undercloud ~]$ openstack overcloud config download
    The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config
    (undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config
    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls
    common_deploy_steps_tasks.yaml external_post_deploy_steps_tasks.yaml templates
    Compute global_vars.yaml update_steps_playbook.yaml
    Controller group_vars update_steps_tasks.yaml
    deploy_steps_playbook.yaml post_upgrade_steps_playbook.yaml upgrade_steps_playbook.yaml
    external_deploy_steps_tasks.yaml post_upgrade_steps_tasks.yaml upgrade_steps_tasks.yaml

    Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps.  This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).

    We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:

    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0
    TASK [Run puppet host configuration for step 1] ********************************************************************
    ok: []

    TASK [debug] *******************************************************************************************************
    fatal: []: FAILED! => {
    "changed": false,
    "failed_when_result": true,
    "outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
    "Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend",
    "exception: connect failed",
    "Warning: Undefined variable '::deploy_config_name'; ",
    " (file & line not available)",
    "Warning: Undefined variable 'deploy_config_name'; ",
    "Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile
    /base/docker.pp:181:5 on node overcloud-controller-0.localdomain"


    NO MORE HOSTS LEFT *************************************************************************************************
    to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry

    PLAY RECAP ********************************************************************************************************* : ok=56 changed=2 unreachable=0 failed=1

    Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node.  We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).

    If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.


    Debugging via Puppet directly

    Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):

    Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.

    (undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@
    Warning: Permanently added '' (ECDSA) to the list of known hosts.
    Last login: Fri Feb 9 14:30:02 2018 from gateway
    [heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/
    [heat-admin@overcloud-controller-0 tripleo-config]$ ls
    docker-container-startup-config-step_1.json docker-container-startup-config-step_4.json puppet_step_config.pp
    docker-container-startup-config-step_2.json docker-container-startup-config-step_5.json
    docker-container-startup-config-step_3.json docker-container-startup-config-step_6.json

    The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host

    We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.

    [root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step
    [root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json
    {"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp
    Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain

    Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.

    Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.

    That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.

    by Steve Hardy ( at February 09, 2018 05:04 PM

    RDO Blog

    Coming soon: Rocky PTG in Dublin

    PTG Logo In just a few weeks, the OpenStack Foundation will be holding the PTG – the Project Teams Gathering – in Dublin, Ireland. At this event, the various project teams will discuss what will be implemented in the Rocky release of OpenStack.

    This is the third PTG, with the first one being in Atlanta, and the second in Denver. At each PTG, I’ve done video interviews of the various project teams, about what they accomplished in the just-completed cycle, and what they intend to do in the next.

    The videos from Atlanta are HERE.

    And the videos from Denver are HERE.

    If you’ll be at this PTG, please consider doing an interview with your project team. You can sign up in the Google doc. And please take a moment to review the information about what kind of questions I’ll be asking. I’ll be interviewing Tuesday through Friday. I’ll know the specific location where I’ll be set up once I’m on-site on Monday.

    by Rich Bowen at February 09, 2018 06:51 AM

    February 08, 2018

    RDO Blog

    FOSDEM 2018

    Last weekend was FOSDEM, the annual Free and Open Source software convention in Brussels. The OpenStack Foundation had a table at the event, where there were lots of opportunities to talk with people either using OpenStack, or learning about it for the first time.

    There was a good crowd that came by the table, and we had great conversations with many users.

    The table was staffed entirely by volunteers from the OpenStack developer community, representing several different organizations.

    On the day before FOSDEM, the CentOS community held their usual pre-FOSDEM Dojo, and a few members of the RDO community. You can see the videos from that event on the CentOS YouTube channel.

    by Rich Bowen at February 08, 2018 08:51 PM

    Lars Kellogg-Stedman

    Listing iptables rules with line numbers

    You can list iptables rules with rule numbers using the --line-numbers option, but this only works in list (-L) mode. I find it much more convenient to view rules using the output from iptables -S or iptables-save.

    You can augment the output from these commands with rule numbers with the …

    by Lars Kellogg-Stedman at February 08, 2018 05:00 AM