aboutsummaryrefslogtreecommitdiff
path: root/block/qcow2.c
AgeCommit message (Collapse)AuthorFilesLines
2017-05-11qcow2: Discard/zero clusters by byte countEric Blake1-13/+9
Passing a byte offset, but sector count, when we ultimately want to operate on cluster granularity, is madness. Clean up the external interfaces to take both offset and count as bytes, while still keeping the assertion added previously that the caller must align the values to a cluster. Then rename things to make sure backports don't get confused by changed units: instead of qcow2_discard_clusters() and qcow2_zero_clusters(), we now have qcow2_cluster_discard() and qcow2_cluster_zeroize(). The internal functions still operate on clusters at a time, and return an int for number of cleared clusters; but on an image with 2M clusters, a single L2 table holds 256k entries that each represent a 2M cluster, totalling well over INT_MAX bytes if we ever had a request for that many bytes at once. All our callers currently limit themselves to 32-bit bytes (and therefore fewer clusters), but by making this function 64-bit clean, we have one less place to clean up if we later improve the block layer to support 64-bit bytes through all operations (with the block layer auto-fragmenting on behalf of more-limited drivers), rather than the current state where some interfaces are artificially limited to INT_MAX at a time. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20170507000552.20847-13-eblake@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-05-11qcow2: Optimize write zero of unaligned tail clusterEric Blake1-0/+7
We've already improved discards to operate efficiently on the tail of an unaligned qcow2 image; it's time to make a similar improvement to write zeroes. The special case is only valid at the tail cluster of a file, where we must recognize that any sectors beyond the image end would implicitly read as zero, and therefore should not penalize our logic for widening a partial cluster into writing the whole cluster as zero. However, note that for now, the special case of end-of-file is only recognized if there is no backing file, or if the backing file has the same length; that's because when the backing file is shorter than the active layer, we don't have code in place to recognize that reads of a sector unallocated at the top and beyond the backing end-of-file are implicitly zero. It's not much of a real loss, because most people don't use images that aren't cluster-aligned, or where the active layer is a different size than the backing layer (especially where the difference falls within a single cluster). Update test 154 to cover the new scenarios, using two images of intentionally differing length. While at it, fix the test to gracefully skip when run as ./check -qcow2 -o compat=0.10 154 since the older format lacks zero clusters already required earlier in the test. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Message-id: 20170507000552.20847-11-eblake@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-05-11qcow2: Make distinction between zero cluster types obviousEric Blake1-3/+6
Treat plain zero clusters differently from allocated ones, so that we can simplify the logic of checking whether an offset is present. Do this by splitting QCOW2_CLUSTER_ZERO into two new enums, QCOW2_CLUSTER_ZERO_PLAIN and QCOW2_CLUSTER_ZERO_ALLOC. I tried to arrange the enum so that we could use 'ret <= QCOW2_CLUSTER_ZERO_PLAIN' for all unallocated types, and 'ret >= QCOW2_CLUSTER_ZERO_ALLOC' for allocated types, although I didn't actually end up taking advantage of the layout. In many cases, this leads to simpler code, by properly combining cases (sometimes, both zero types pair together, other times, plain zero is more like unallocated while allocated zero is more like normal). Signed-off-by: Eric Blake <eblake@redhat.com> Message-id: 20170507000552.20847-7-eblake@redhat.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-05-11qcow2: Fix preallocation size formulaMax Reitz1-4/+5
When calculating the number of reftable entries, we should actually use the number of refblocks and not (wrongly[1]) re-calculate it. [1] "Wrongly" means: Dividing the number of clusters by the number of entries per refblock and rounding down instead of up. Reported-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2017-05-09qobject: Use simpler QDict/QList scalar insertion macrosEric Blake1-2/+2
We now have macros in place to make it less verbose to add a scalar to QDict and QList, so use them. Patch created mechanically via: spatch --sp-file scripts/coccinelle/qobject.cocci \ --macro-file scripts/cocci-macro-file.h --dir . --in-place then touched up manually to fix a couple of '?:' back to original spacing, as well as avoiding a long line in monitor.c. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20170427215821.19397-7-eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2017-04-28qcow2: Allow discard of final unaligned clusterEric Blake1-1/+6
As mentioned in commit 0c1bd46, we ignored requests to discard the trailing cluster of an unaligned image. While discard is an advisory operation from the guest standpoint, (and we are therefore free to ignore any request), our qcow2 implementation exploits the fact that a discarded cluster reads back as 0. As long as we discard on cluster boundaries, we are fine; but that means we could observe non-zero data leaked at the tail of an unaligned image. Enhance iotest 66 to cover this case, and fix the implementation to honor a discard request on the final partial cluster. Signed-off-by: Eric Blake <eblake@redhat.com> Message-id: 20170407013709.18440-1-eblake@redhat.com Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-04-28block: Add .bdrv_truncate() error messagesMax Reitz1-0/+2
Add missing error messages for the block driver implementations of .bdrv_truncate(); drop the generic one from block.c's bdrv_truncate(). Since one of these changes touches a mis-indented block in block/file-posix.c, this patch fixes that coding style issue along the way. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20170328205129.15138-5-mreitz@redhat.com Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-04-28block: Add errp to BD.bdrv_truncate()Max Reitz1-4/+4
Add an Error parameter to the block drivers' bdrv_truncate() interface. If a block driver does not set this in case of an error, the generic bdrv_truncate() implementation will do so. Where it is obvious, this patch also makes some block drivers set this value. Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20170328205129.15138-4-mreitz@redhat.com Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-04-28block: Add errp to b{lk,drv}_truncate()Max Reitz1-5/+9
For one thing, this allows us to drop the error message generation from qemu-img.c and blockdev.c and instead have it unified in bdrv_truncate(). Signed-off-by: Max Reitz <mreitz@redhat.com> Message-id: 20170328205129.15138-3-mreitz@redhat.com Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Add BDRV_O_RESIZE for blk_new_open()Kevin Wolf1-2/+4
blk_new_open() is a convenience function that processes flags rather than QDict options as a simple way to just open an image file. In order to keep it convenient in the future, it must automatically request the necessary permissions. This can easily be inferred from the flags for read and write, but we need another flag that tells us whether to get the resize permission. We can't just always request it because that means that no block jobs can run on the resulting BlockBackend (which is something that e.g. qemu-img commit wants to do), but we also can't request it never because most of the .bdrv_create() implementations call blk_truncate(). The solution is to introduce another flag that is passed by all users that want to resize the image. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Fam Zheng <famz@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-28block: Add error parameter to blk_insert_bs()Kevin Wolf1-2/+8
Now that blk_insert_bs() requests the BlockBackend permissions for the node it attaches to, it can fail. Instead of aborting, pass the errors to the callers. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Add permissions to blk_new()Kevin Wolf1-1/+1
We want every user to be specific about the permissions it needs, so we'll pass the initial permissions as parameters to blk_new(). A user only needs to call blk_set_perm() if it wants to change the permissions after the fact. The permissions are stored in the BlockBackend and applied whenever a BlockDriverState should be attached in blk_insert_bs(). This does not include actually choosing the right set of permissions everywhere yet. Instead, the usual FIXME comment is added to each place and will be addressed in individual patches. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-28block: Request child permissions in format driversKevin Wolf1-0/+1
This makes use of the .bdrv_child_perm() implementation for formats that we just added. All format drivers expose the permissions they actually need nows, so that they can be set accordingly and updated when parents are attached or detached. The only format not included here is raw, which was already converted with the other filter drivers. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Fam Zheng <famz@redhat.com>
2017-02-24block: Attach bs->file only during .bdrv_open()Kevin Wolf1-3/+15
The way that attaching bs->file worked was a bit unusual in that it was the only child that would be attached to a node which is not opened yet. Because of this, the block layer couldn't know yet which permissions the driver would eventually need. This patch moves the point where bs->file is attached to the beginning of the individual .bdrv_open() implementations, so drivers already know what they are going to do with the child. This is also more consistent with how driver-specific children work. For a moment, bdrv_open() gets its own BdrvChild to perform image probing, but instead of directly assigning this BdrvChild to the BDS, it becomes a temporary one and the node name is passed as an option to the drivers, so that they can simply use bdrv_open_child() to create another reference for their own use. This duplicated child for (the not opened yet) bs is not the final state, a follow-up patch will change the image probing code to use a BlockBackend, which is completely independent of bs. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-24block: Pass BdrvChild to bdrv_truncate()Kevin Wolf1-2/+2
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-24qcow2: Use BB for resizing in qcow2_amend_options()Kevin Wolf1-1/+5
In order to able to convert bdrv_truncate() to take a BdrvChild and later to correctly check the resize permission here, we need to use a BlockBackend for resizing the image. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com>
2017-02-12qcow2: Optimize the refcount-block overlap checkAlberto Garcia1-0/+1
The metadata overlap checks introduced in a40f1c2add help detect corruption in the qcow2 image by verifying that data writes don't overlap with existing metadata sections. The 'refcount-block' check in particular iterates over the refcount table in order to get the addresses of all refcount blocks and check that none of them overlap with the region where we want to write. The problem with the refcount table is that since it always occupies complete clusters its size is usually very big. With the default values of cluster_size=64KB and refcount_bits=16 this table holds 8192 entries, each one of them enough to map 2GB worth of host clusters. So unless we're using images with several TB of allocated data this table is going to be mostly empty, and iterating over it is a waste of CPU. If the storage backend is fast enough this can have an effect on I/O performance. This patch keeps the index of the last used (i.e. non-zero) entry in the refcount table and updates it every time the table changes. The refcount-block overlap check then uses that index instead of reading the whole table. In my tests with a 4GB qcow2 file stored in RAM this doubles the amount of write IOPS. Signed-off-by: Alberto Garcia <berto@igalia.com> Message-id: 20170201123828.4815-1-berto@igalia.com Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-12-06qcow2: Don't strand clusters near 2G intervals during commitEric Blake1-1/+2
The qcow2_make_empty() function is reached during 'qemu-img commit', in order to clear out ALL clusters of an image. However, if the image cannot use the fast code path (true if the image is format 0.10, or if the image contains a snapshot), the cluster size is larger than 512, and the image is larger than 2G in size, then our choice of sector_step causes problems. Since it is not cluster aligned, but qcow2_discard_clusters() silently ignores an unaligned head or tail, we are leaving clusters allocated. Enhance the testsuite to expose the flaw, and patch the problem by ensuring our step size is aligned. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-11-25qcow2: Allow 'cache-clean-interval' in Linux onlyAlberto Garcia1-0/+8
The cache-clean-interval option of qcow2 only works on Linux. However we allow setting it in other systems regardless of whether it works or not. In those systems this option is not simply a no-op: it actually invalidates perfectly valid cache tables for no good reason without freeing their memory. This patch forbids using that option in non-Linux systems. Signed-off-by: Alberto Garcia <berto@igalia.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-11-22block: Return -ENOTSUP rather than assert on unaligned discardsEric Blake1-0/+5
Right now, the block layer rounds discard requests, so that individual drivers are able to assert that discard requests will never be unaligned. But there are some ISCSI devices that track and coalesce multiple unaligned requests, turning it into an actual discard if the requests eventually cover an entire page, which implies that it is better to always pass discard requests as low down the stack as possible. In isolation, this patch has no semantic effect, since the block layer currently never passes an unaligned request through. But the block layer already has code that silently ignores drivers that return -ENOTSUP for a discard request that cannot be honored (as well as drivers that return 0 even when nothing was done). But the next patch will update the block layer to fragment discard requests, so that clients are guaranteed that they are either dealing with an unaligned head or tail, or an aligned core, making it similar to the block layer semantics of write zero fragmentation. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-11-22qcow2: Inform block layer about discard boundariesEric Blake1-0/+1
At the qcow2 layer, discard is only possible on a per-cluster basis; at the moment, qcow2 silently rounds any unaligned requests to this granularity. However, an upcoming patch will fix a regression in the block layer ignoring too much of an unaligned discard request, by changing the block layer to break up a discard request at alignment boundaries; for that to work, the block layer must know about our limits. However, we can't go one step further by changing qcow2_discard_clusters() to assert that requests are always aligned, since that helper function is reached on paths outside of the block layer. CC: qemu-stable@nongnu.org Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-10-24qcow2: Support BDRV_REQ_MAY_UNMAPFam Zheng1-1/+2
Handling this is similar to what is done to the L2 entry in the case of compressed clusters. Signed-off-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-10-19crypto: extend mode as a parameter in qcrypto_cipher_supports()Gonglei1-1/+2
It can't guarantee all cipher modes are supported if one cipher algorithm is supported by a backend. Let's extend qcrypto_cipher_supports() to take both the algorithm and mode as parameters. Signed-off-by: Gonglei <arei.gonglei@huawei.com> Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
2016-09-13qcow2: avoid memcpy(dst, NULL, len)Stefan Hajnoczi1-1/+4
Section "7.1.4 Use of library functions" in the C99 standard says: If an argument to a function has an invalid value (such as [...] a null pointer [...]) [...] the behavior is undefined. Additionally the "searching and sorting" functions are specified as requiring valid pointer values as described in 7.1.4. This patch fixes the following sanitizer errors: block/qcow2.c:1807:41: runtime error: null pointer passed as argument 2, which is declared to never be null block/qcow2-cluster.c:86:26: runtime error: null pointer passed as argument 2, which is declared to never be null Reported-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Message-id: 1473758138-19260-1-git-send-email-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-09-05qcow2: fix iovec size at qcow2_co_pwritev_compressedPavel Butsykin1-1/+1
Use bytes as the size would be more exact than s->cluster_size. Although qemu_iovec_to_buf() will not allow to go beyond the qiov. Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-09-05qcow2: cleanup qcow2_co_pwritev_compressed to avoid the recursionPavel Butsykin1-17/+7
Now that the function uses a vector instead of a buffer, there is no need to use recursive code. Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Jeff Cody <jcody@redhat.com> CC: Markus Armbruster <armbru@redhat.com> CC: Eric Blake <eblake@redhat.com> CC: John Snow <jsnow@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-09-05qcow2: add qcow2_co_pwritev_compressedPavel Butsykin1-74/+50
Added implementation of the qcow2_co_pwritev_compressed function that will allow us to safely use compressed writes for the qcow2 from running VMs. Signed-off-by: Pavel Butsykin <pbutsykin@virtuozzo.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Jeff Cody <jcody@redhat.com> CC: Markus Armbruster <armbru@redhat.com> CC: Eric Blake <eblake@redhat.com> CC: John Snow <jsnow@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-07-26qcow2: do not allocate extra memoryVladimir Sementsov-Ogievskiy1-1/+1
There are no needs to allocate more than one cluster, as we set avail_out for deflate to one cluster. Zlib docs (http://www.zlib.net/manual.html) says: "deflate compresses as much data as possible, and stops when the input buffer becomes empty or the output buffer becomes full." So, deflate will not write more than avail_out to output buffer. If there is not enough space in output buffer for compressed data (it may be larger than input data) deflate just returns Z_OK. (if all data is compressed and written to output buffer deflate returns Z_STREAM_END). Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-id: 1468515565-81313-1-git-send-email-vsementsov@virtuozzo.com Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: John Snow <jsnow@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-07-20qcow2: Switch .bdrv_co_discard() to byte-basedEric Blake1-5/+5
Another step towards killing off sector-based block APIs. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 1468624988-423-15-git-send-email-eblake@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-13coroutine: move entry argument to qemu_coroutine_createPaolo Bonzini1-2/+2
In practice the entry argument is always known at creation time, and it is confusing that sometimes qemu_coroutine_enter is used with a non-NULL argument to re-enter a coroutine (this happens in block/sheepdog.c and tests/test-coroutine.c). So pass the opaque value at creation time, for consistency with e.g. aio_bh_new. Mostly done with the following semantic patch: @ entry1 @ expression entry, arg, co; @@ - co = qemu_coroutine_create(entry); + co = qemu_coroutine_create(entry, arg); ... - qemu_coroutine_enter(co, arg); + qemu_coroutine_enter(co); @ entry2 @ expression entry, arg; identifier co; @@ - Coroutine *co = qemu_coroutine_create(entry); + Coroutine *co = qemu_coroutine_create(entry, arg); ... - qemu_coroutine_enter(co, arg); + qemu_coroutine_enter(co); @ entry3 @ expression entry, arg; @@ - qemu_coroutine_enter(qemu_coroutine_create(entry), arg); + qemu_coroutine_enter(qemu_coroutine_create(entry, arg)); @ reentry @ expression co; @@ - qemu_coroutine_enter(co, NULL); + qemu_coroutine_enter(co); except for the aforementioned few places where the semantic patch stumbled (as expected) and for test_co_queue, which would otherwise produce an uninitialized variable warning. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-07-05block/qcow2: Don't use cpu_to_*w()Peter Maydell1-3/+3
Don't use the cpu_to_*w() functions, which we are trying to deprecate. Instead either just use cpu_to_*() to do the byteswap, or use st*_be_p() if we need to do the store somewhere other than to a variable that's already the correct type. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 1466093177-17890-1-git-send-email-peter.maydell@linaro.org Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2016-07-05block: Convert bdrv_co_preadv/pwritev to BdrvChildKevin Wolf1-3/+3
This is the final patch for converting the common I/O path to take a BdrvChild parameter instead of BlockDriverState. The completion of this conversion means that all users that perform I/O on an image need to actually hold a reference (in the form of BdrvChild, possible as part of a BlockBackend) to that image. This also protects against inconsistent use of BlockBackend vs. BlockDriverState functions because direct use of a BlockDriverState isn't possible any more and blk->root is private for block-backends.c. In addition, we can now distinguish different users in the I/O path, and the future op blockers work is going to add assertions based on permissions stored in BdrvChild. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-05block: Convert bdrv_pwrite_zeroes() to BdrvChildKevin Wolf1-2/+2
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-05block: Convert bdrv_pwrite(v/_sync) to BdrvChildKevin Wolf1-7/+7
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-05block: Convert bdrv_pread(v) to BdrvChildKevin Wolf1-8/+8
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Max Reitz <mreitz@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-05block: Convert bdrv_write() to BdrvChildKevin Wolf1-1/+46
Signed-off-by: Kevin Wolf <kwolf@redhat.com> Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-07-05block: Use bool as appropriate for BDS membersEric Blake1-1/+1
Using int for values that are only used as booleans is confusing. While at it, rearrange a couple of members so that all the bools are contiguous. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-07-05block: Move request_alignment into BlockLimitEric Blake1-1/+1
It makes more sense to have ALL block size limit constraints in the same struct. Improve the documentation while at it. Simplify a couple of conditionals, now that we have audited and documented that request_alignment is always non-zero. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-07-05qcow2: Set request_alignment during .bdrv_refresh_limits()Eric Blake1-3/+4
We want to eventually stick request_alignment alongside other BlockLimits, but first, we must ensure it is populated at the same time as all other limits, rather than being a special case that is set only when a block is first opened. Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-20error: Remove NULL checks on error_propagate() callsEduardo Habkost1-3/+1
error_propagate() already ignores local_err==NULL, so there's no need to check it before calling. Coccinelle patch used to perform the changes added to scripts/coccinelle/error_propagate_null.cocci. Reviewed-by: Eric Blake <eblake@redhat.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Signed-off-by: Eduardo Habkost <ehabkost@redhat.com> Message-Id: <1465855078-19435-2-git-send-email-ehabkost@redhat.com> Signed-off-by: Markus Armbruster <armbru@redhat.com>
2016-06-16qcow2: Let vmstate call qcow2_co_preadv/pwrite directlyKevin Wolf1-20/+4
We don't really want to go through the block layer in order to read from or write to the vmstate in a qcow2 image. Doing so required a few ugly hacks like saving and restoring the old image size (because writing to vmstate offsets would increase the image size) or disabling the "reads after EOF = zeroes" logic. When calling the right functions directly, these hacks aren't necessary any more. Note that .bdrv_vmstate_load/save() return 0 instead of the number of bytes in case of success now. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-06-16block: Make .bdrv_load_vmstate() vectoredKevin Wolf1-3/+3
This brings it in line with .bdrv_save_vmstate(). Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
2016-06-16block: drop support for using qcow[2] encryption with system emulatorsDaniel P. Berrange1-4/+10
Back in the 2.3.0 release we declared qcow[2] encryption as deprecated, warning people that it would be removed in a future release. commit a1f688f4152e65260b94f37543521ceff8bfebe4 Author: Markus Armbruster <armbru@redhat.com> Date: Fri Mar 13 21:09:40 2015 +0100 block: Deprecate QCOW/QCOW2 encryption The code still exists today, but by a (happy?) accident we entirely broke the ability to use qcow[2] encryption in the system emulators in the 2.4.0 release due to commit 8336aafae1451d54c81dd2b187b45f7c45d2428e Author: Daniel P. Berrange <berrange@redhat.com> Date: Tue May 12 17:09:18 2015 +0100 qcow2/qcow: protect against uninitialized encryption key This commit was designed to prevent future coding bugs which might cause QEMU to read/write data on an encrypted block device in plain text mode before a decryption key is set. It turns out this preventative measure was a little too good, because we already had a long standing bug where QEMU read encrypted data in plain text mode during system emulator startup, in order to guess disk geometry: Thread 10 (Thread 0x7fffd3fff700 (LWP 30373)): #0 0x00007fffe90b1a28 in raise () at /lib64/libc.so.6 #1 0x00007fffe90b362a in abort () at /lib64/libc.so.6 #2 0x00007fffe90aa227 in __assert_fail_base () at /lib64/libc.so.6 #3 0x00007fffe90aa2d2 in () at /lib64/libc.so.6 #4 0x000055555587ae19 in qcow2_co_readv (bs=0x5555562accb0, sector_num=0, remaining_sectors=1, qiov=0x7fffffffd260) at block/qcow2.c:1229 #5 0x000055555589b60d in bdrv_aligned_preadv (bs=bs@entry=0x5555562accb0, req=req@entry=0x7fffd3ffea50, offset=offset@entry=0, bytes=bytes@entry=512, align=align@entry=512, qiov=qiov@entry=0x7fffffffd260, flags=0) at block/io.c:908 #6 0x000055555589b8bc in bdrv_co_do_preadv (bs=0x5555562accb0, offset=0, bytes=512, qiov=0x7fffffffd260, flags=<optimized out>) at block/io.c:999 #7 0x000055555589c375 in bdrv_rw_co_entry (opaque=0x7fffffffd210) at block/io.c:544 #8 0x000055555586933b in coroutine_thread (opaque=0x555557876310) at coroutine-gthread.c:134 #9 0x00007ffff64e1835 in g_thread_proxy (data=0x5555562b5590) at gthread.c:778 #10 0x00007ffff6bb760a in start_thread () at /lib64/libpthread.so.0 #11 0x00007fffe917f59d in clone () at /lib64/libc.so.6 Thread 1 (Thread 0x7ffff7ecab40 (LWP 30343)): #0 0x00007fffe91797a9 in syscall () at /lib64/libc.so.6 #1 0x00007ffff64ff87f in g_cond_wait (cond=cond@entry=0x555555e085f0 <coroutine_cond>, mutex=mutex@entry=0x555555e08600 <coroutine_lock>) at gthread-posix.c:1397 #2 0x00005555558692c3 in qemu_coroutine_switch (co=<optimized out>) at coroutine-gthread.c:117 #3 0x00005555558692c3 in qemu_coroutine_switch (from_=0x5555562b5e30, to_=to_@entry=0x555557876310, action=action@entry=COROUTINE_ENTER) at coroutine-gthread.c:175 #4 0x0000555555868a90 in qemu_coroutine_enter (co=0x555557876310, opaque=0x0) at qemu-coroutine.c:116 #5 0x0000555555859b84 in thread_pool_completion_bh (opaque=0x7fffd40010e0) at thread-pool.c:187 #6 0x0000555555859514 in aio_bh_poll (ctx=ctx@entry=0x5555562953b0) at async.c:85 #7 0x0000555555864d10 in aio_dispatch (ctx=ctx@entry=0x5555562953b0) at aio-posix.c:135 #8 0x0000555555864f75 in aio_poll (ctx=ctx@entry=0x5555562953b0, blocking=blocking@entry=true) at aio-posix.c:291 #9 0x000055555589c40d in bdrv_prwv_co (bs=bs@entry=0x5555562accb0, offset=offset@entry=0, qiov=qiov@entry=0x7fffffffd260, is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at block/io.c:591 #10 0x000055555589c503 in bdrv_rw_co (bs=bs@entry=0x5555562accb0, sector_num=sector_num@entry=0, buf=buf@entry=0x7fffffffd2e0 "\321,", nb_sectors=nb_sectors@entry=21845, is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at block/io.c:614 #11 0x000055555589c562 in bdrv_read_unthrottled (nb_sectors=21845, buf=0x7fffffffd2e0 "\321,", sector_num=0, bs=0x5555562accb0) at block/io.c:622 #12 0x000055555589c562 in bdrv_read_unthrottled (bs=0x5555562accb0, sector_num=sector_num@entry=0, buf=buf@entry=0x7fffffffd2e0 "\321,", nb_sectors=nb_sectors@entry=21845) at block/io.c:634 nb_sectors@entry=1) at block/block-backend.c:504 #14 0x0000555555752e9f in guess_disk_lchs (blk=blk@entry=0x5555562a5290, pcylinders=pcylinders@entry=0x7fffffffd52c, pheads=pheads@entry=0x7fffffffd530, psectors=psectors@entry=0x7fffffffd534) at hw/block/hd-geometry.c:68 #15 0x0000555555752ff7 in hd_geometry_guess (blk=0x5555562a5290, pcyls=pcyls@entry=0x555557875d1c, pheads=pheads@entry=0x555557875d20, psecs=psecs@entry=0x555557875d24, ptrans=ptrans@entry=0x555557875d28) at hw/block/hd-geometry.c:133 #16 0x0000555555752b87 in blkconf_geometry (conf=conf@entry=0x555557875d00, ptrans=ptrans@entry=0x555557875d28, cyls_max=cyls_max@entry=65536, heads_max=heads_max@entry=16, secs_max=secs_max@entry=255, errp=errp@entry=0x7fffffffd5e0) at hw/block/block.c:71 #17 0x0000555555799bc4 in ide_dev_initfn (dev=0x555557875c80, kind=IDE_HD) at hw/ide/qdev.c:174 #18 0x0000555555768394 in device_realize (dev=0x555557875c80, errp=0x7fffffffd640) at hw/core/qdev.c:247 #19 0x0000555555769a81 in device_set_realized (obj=0x555557875c80, value=<optimized out>, errp=0x7fffffffd730) at hw/core/qdev.c:1058 #20 0x00005555558240ce in property_set_bool (obj=0x555557875c80, v=<optimized out>, opaque=0x555557875de0, name=<optimized out>, errp=0x7fffffffd730) at qom/object.c:1514 #21 0x0000555555826c87 in object_property_set_qobject (obj=obj@entry=0x555557875c80, value=value@entry=0x55555784bcb0, name=name@entry=0x55555591cb3d "realized", errp=errp@entry=0x7fffffffd730) at qom/qom-qobject.c:24 #22 0x0000555555825760 in object_property_set_bool (obj=obj@entry=0x555557875c80, value=value@entry=true, name=name@entry=0x55555591cb3d "realized", errp=errp@entry=0x7fffffffd730) at qom/object.c:905 #23 0x000055555576897b in qdev_init_nofail (dev=dev@entry=0x555557875c80) at hw/core/qdev.c:380 #24 0x0000555555799ead in ide_create_drive (bus=bus@entry=0x555557629630, unit=unit@entry=0, drive=0x5555562b77e0) at hw/ide/qdev.c:122 #25 0x000055555579a746 in pci_ide_create_devs (dev=dev@entry=0x555557628db0, hd_table=hd_table@entry=0x7fffffffd830) at hw/ide/pci.c:440 #26 0x000055555579b165 in pci_piix3_ide_init (bus=<optimized out>, hd_table=0x7fffffffd830, devfn=<optimized out>) at hw/ide/piix.c:218 #27 0x000055555568ca55 in pc_init1 (machine=0x5555562960a0, pci_enabled=1, kvmclock_enabled=<optimized out>) at /home/berrange/src/virt/qemu/hw/i386/pc_piix.c:256 #28 0x0000555555603ab2 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4249 So the safety net is correctly preventing QEMU reading cipher text as if it were plain text, during startup and aborting QEMU to avoid bad usage of this data. For added fun this bug only happens if the encrypted qcow2 file happens to have data written to the first cluster, otherwise the cluster won't be allocated and so qcow2 would not try the decryption routines at all, just return all 0's. That no one even noticed, let alone reported, this bug that has shipped in 2.4.0, 2.5.0 and 2.6.0 shows that the number of actual users of encrypted qcow2 is approximately zero. So rather than fix the crash, and backport it to stable releases, just go ahead with what we have warned users about and disable any use of qcow2 encryption in the system emulators. qemu-img/qemu-io/qemu-nbd are still able to access qcow2 encrypted images for the sake of data conversion. In the future, qcow2 will gain support for the alternative luks format, but when this happens it'll be using the '-object secret' infrastructure for getting keys, which avoids this problematic scenario entirely. Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-16qcow2: Implement .bdrv_co_pwritev()Kevin Wolf1-48/+41
This changes qcow2 to implement the byte-based .bdrv_co_pwritev interface rather than the sector-based old one. As preallocation uses the same allocation function as normal writes, and the interface of that function needs to be changed, it is converted in the same patch. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2016-06-16qcow2: Implement .bdrv_co_preadv()Kevin Wolf1-49/+59
Reading from qcow2 images is now byte granularity. Most of the affected code in qcow2 actually gets simpler with this change. The only exception is encryption, which is fixed on 512 bytes blocks; in order to keep this working, bs->request_alignment is set for encrypted images. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
2016-06-08qcow2: avoid extra flushes in qcow2Denis V. Lunev1-2/+2
The problem with excessive flushing was found by a couple of performance tests: - parallel directory tree creation (from 2 processes) - 32 cached writes + fsync at the end in a loop For the first one results improved from 2.6 loops/sec to 3.5 loops/sec. Each loop creates 10^3 directories with 10 files in each. For the second one results improved from ~600 fsync/sec to ~1100 fsync/sec. Though, it was run on SSD so it probably won't show such performance gain on rotational media. qcow2_cache_flush() calls bdrv_flush() unconditionally after writing cache entries of a particular cache. This can lead to as many as 2 additional fdatasyncs inside bdrv_flush. We can simply skip all fdatasync calls inside qcow2_co_flush_to_os as bdrv_flush for sure will do the job. These flushes are necessary to keep the right order of writes to the different caches. Though this is not necessary in the current code base as this ordering is ensured through the flush in qcow2_cache_flush_dependency(). Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Pavel Borzenkov <pborzenkov@virtuozzo.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Max Reitz <mreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-08qcow2: Convert to bdrv_co_pwrite_zeroes()Eric Blake1-18/+19
Another step on our continuing quest to switch to byte-based interfaces. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-08block: Switch bdrv_write_zeroes() to byte interfaceEric Blake1-5/+4
Rename to bdrv_pwrite_zeroes() to let the compiler ensure we cater to the updated semantics. Do the same for bdrv_co_write_zeroes(). Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-08block: Track write zero limits in bytesEric Blake1-1/+1
Another step towards removing sector-based interfaces: convert the maximum write and minimum alignment values from sectors to bytes. Rename the variables to let the compiler check that all users are converted to the new semantics. The maximum remains an int as long as BDRV_REQUEST_MAX_SECTORS is constrained by INT_MAX (this means that we can't even support a 2G write_zeroes, but just under it) - changing operation lengths to unsigned or to 64-bits is a much bigger audit, and debatable if we even want to do it (since at the core, a 32-bit platform will still have ssize_t as its underlying limit on write()). Meanwhile, alignment is changed to 'uint32_t', since it makes no sense to have an alignment larger than the maximum write, and less painful to use an unsigned type with well-defined behavior in bit operations than to have to worry about what happens if a driver mistakenly supplies a negative alignment. Add an assert that no one was trying to use sectors to get a write zeroes larger than 2G, and therefore that a later conversion to bytes won't be impacted by keeping the limit at 32 bits. Signed-off-by: Eric Blake <eblake@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2016-06-08qcow2: Catch more unaligned write_zero into zero clusterEric Blake1-24/+23
is_zero_cluster() and is_zero_cluster_top_locked() are used only by qcow2_co_write_zeroes(). The former is too broad (we don't care if the sectors we are about to overwrite are non-zero, only that all other sectors in the cluster are zero), so it needs to be called up to twice but with smaller limits - rename it along with adding the neeeded parameter. The latter can be inlined for more compact code. The testsuite change shows that we now have a sparser top file when an unaligned write_zeroes overwrites the only portion of the backing file with data. Based on a patch proposal by Denis V. Lunev. CC: Denis V. Lunev <den@openvz.org> Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>