From 974854ab0728532600c72e41a44d6ce1cf8f20a4 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Tue, 12 Jul 2022 18:37:54 -0700 Subject: cxl/acpi: Track CXL resources in iomem_resource Recall that CXL capable address ranges, on ACPI platforms, are published in the CEDT.CFMWS (CXL Early Discovery Table: CXL Fixed Memory Window Structures). These windows represent both the actively mapped capacity and the potential address space that can be dynamically assigned to a new CXL decode configuration (region / interleave-set). CXL endpoints like DDR DIMMs can be mapped at any physical address including 0 and legacy ranges. There is an expectation and requirement that the /proc/iomem interface and the iomem_resource tree in the kernel reflect the full set of platform address ranges. I.e. that every address range that platform firmware and bus drivers enumerate be reflected as an iomem_resource entry. The hard requirement to do this for CXL arises from the fact that facilities like CONFIG_DEVICE_PRIVATE expect to be able to treat empty iomem_resource ranges as free for software to use as proxy address space. Without CXL publishing its potential address ranges in iomem_resource, the CONFIG_DEVICE_PRIVATE mechanism may inadvertently steal capacity reserved for runtime provisioning of new CXL regions. So, iomem_resource needs to know about both active and potential CXL resource ranges. The active CXL resources might already be reflected in iomem_resource as "System RAM". insert_resource_expand_to_fit() handles re-parenting "System RAM" underneath a CXL window. The "_expand_to_fit()" behavior handles cases where a CXL window is not a strict superset of an existing entry in the iomem_resource tree. The "_expand_to_fit()" behavior is acceptable from the perspective of resource allocation. The expansion happens because a conflicting resource range is already populated, which means the resource boundary expansion does not result in any additional free CXL address space being made available. CXL address space allocation is always bounded by the orginal unexpanded address range. However, the potential for expansion does mean that something like walk_iomem_res_desc(IORES_DESC_CXL...) can only return fuzzy answers on corner case platforms that cause the resource tree to expand a CXL window resource over a range that is not decoded by CXL. This would be an odd platform configuration, but if it becomes a problem in practice the CXL subsytem could just publish an API that returns definitive answers. Cc: Andrew Morton Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Tony Luck Cc: Christoph Hellwig Reviewed-by: Jonathan Cameron Acked-by: Greg Kroah-Hartman Link: https://lore.kernel.org/r/165784325943.1758207.5310344844375305118.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams --- kernel/resource.c | 7 +++++++ 1 file changed, 7 insertions(+) (limited to 'kernel/resource.c') diff --git a/kernel/resource.c b/kernel/resource.c index 34eaee179689..53a534db350e 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -891,6 +891,13 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new) } write_unlock(&resource_lock); } +/* + * Not for general consumption, only early boot memory map parsing, PCI + * resource discovery, and late discovery of CXL resources are expected + * to use this interface. The former are built-in and only the latter, + * CXL, is a module. + */ +EXPORT_SYMBOL_NS_GPL(insert_resource_expand_to_fit, CXL); /** * remove_resource - Remove a resource in the resource tree -- cgit v1.2.3 From 14b80582c43e4f550acfd93c2b2cadbe36ea0874 Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Fri, 20 May 2022 13:41:24 -0700 Subject: resource: Introduce alloc_free_mem_region() The core of devm_request_free_mem_region() is a helper that searches for free space in iomem_resource and performs __request_region_locked() on the result of that search. The policy choices of the implementation conform to what CONFIG_DEVICE_PRIVATE users want which is memory that is immediately marked busy, and a preference to search for the first-fit free range in descending order from the top of the physical address space. CXL has a need for a similar allocator, but with the following tweaks: 1/ Search for free space in ascending order 2/ Search for free space relative to a given CXL window 3/ 'insert' rather than 'request' the new resource given downstream drivers from the CXL Region driver (like the pmem or dax drivers) are responsible for request_mem_region() when they activate the memory range. Rework __request_free_mem_region() into get_free_mem_region() which takes a set of GFR_* (Get Free Region) flags to control the allocation policy (ascending vs descending), and "busy" policy (insert_resource() vs request_region()). As part of the consolidation of the legacy GFR_REQUEST_REGION case with the new default of just inserting a new resource into the free space some minor cleanups like not checking for NULL before calling devres_free() (which does its own check) is included. Suggested-by: Jason Gunthorpe Link: https://lore.kernel.org/linux-cxl/20220420143406.GY2120790@nvidia.com/ Cc: Matthew Wilcox Cc: Christoph Hellwig Reviewed-by: Jonathan Cameron Link: https://lore.kernel.org/r/165784333333.1758207.13703329337805274043.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams --- include/linux/ioport.h | 2 + kernel/resource.c | 178 +++++++++++++++++++++++++++++++++++++++---------- mm/Kconfig | 5 ++ 3 files changed, 150 insertions(+), 35 deletions(-) (limited to 'kernel/resource.c') diff --git a/include/linux/ioport.h b/include/linux/ioport.h index 79d1ad6d6275..616b683563a9 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -330,6 +330,8 @@ struct resource *devm_request_free_mem_region(struct device *dev, struct resource *base, unsigned long size); struct resource *request_free_mem_region(struct resource *base, unsigned long size, const char *name); +struct resource *alloc_free_mem_region(struct resource *base, + unsigned long size, unsigned long align, const char *name); static inline void irqresource_disabled(struct resource *res, u32 irq) { diff --git a/kernel/resource.c b/kernel/resource.c index 53a534db350e..4c5e80b92f2f 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -489,8 +489,9 @@ int __weak page_is_ram(unsigned long pfn) } EXPORT_SYMBOL_GPL(page_is_ram); -static int __region_intersects(resource_size_t start, size_t size, - unsigned long flags, unsigned long desc) +static int __region_intersects(struct resource *parent, resource_size_t start, + size_t size, unsigned long flags, + unsigned long desc) { struct resource res; int type = 0; int other = 0; @@ -499,7 +500,7 @@ static int __region_intersects(resource_size_t start, size_t size, res.start = start; res.end = start + size - 1; - for (p = iomem_resource.child; p ; p = p->sibling) { + for (p = parent->child; p ; p = p->sibling) { bool is_type = (((p->flags & flags) == flags) && ((desc == IORES_DESC_NONE) || (desc == p->desc))); @@ -543,7 +544,7 @@ int region_intersects(resource_size_t start, size_t size, unsigned long flags, int ret; read_lock(&resource_lock); - ret = __region_intersects(start, size, flags, desc); + ret = __region_intersects(&iomem_resource, start, size, flags, desc); read_unlock(&resource_lock); return ret; @@ -1780,62 +1781,139 @@ void resource_list_free(struct list_head *head) } EXPORT_SYMBOL(resource_list_free); -#ifdef CONFIG_DEVICE_PRIVATE -static struct resource *__request_free_mem_region(struct device *dev, - struct resource *base, unsigned long size, const char *name) +#ifdef CONFIG_GET_FREE_REGION +#define GFR_DESCENDING (1UL << 0) +#define GFR_REQUEST_REGION (1UL << 1) +#define GFR_DEFAULT_ALIGN (1UL << PA_SECTION_SHIFT) + +static resource_size_t gfr_start(struct resource *base, resource_size_t size, + resource_size_t align, unsigned long flags) +{ + if (flags & GFR_DESCENDING) { + resource_size_t end; + + end = min_t(resource_size_t, base->end, + (1ULL << MAX_PHYSMEM_BITS) - 1); + return end - size + 1; + } + + return ALIGN(base->start, align); +} + +static bool gfr_continue(struct resource *base, resource_size_t addr, + resource_size_t size, unsigned long flags) +{ + if (flags & GFR_DESCENDING) + return addr > size && addr >= base->start; + /* + * In the ascend case be careful that the last increment by + * @size did not wrap 0. + */ + return addr > addr - size && + addr <= min_t(resource_size_t, base->end, + (1ULL << MAX_PHYSMEM_BITS) - 1); +} + +static resource_size_t gfr_next(resource_size_t addr, resource_size_t size, + unsigned long flags) +{ + if (flags & GFR_DESCENDING) + return addr - size; + return addr + size; +} + +static void remove_free_mem_region(void *_res) +{ + struct resource *res = _res; + + if (res->parent) + remove_resource(res); + free_resource(res); +} + +static struct resource * +get_free_mem_region(struct device *dev, struct resource *base, + resource_size_t size, const unsigned long align, + const char *name, const unsigned long desc, + const unsigned long flags) { - resource_size_t end, addr; + resource_size_t addr; struct resource *res; struct region_devres *dr = NULL; - size = ALIGN(size, 1UL << PA_SECTION_SHIFT); - end = min_t(unsigned long, base->end, (1UL << MAX_PHYSMEM_BITS) - 1); - addr = end - size + 1UL; + size = ALIGN(size, align); res = alloc_resource(GFP_KERNEL); if (!res) return ERR_PTR(-ENOMEM); - if (dev) { + if (dev && (flags & GFR_REQUEST_REGION)) { dr = devres_alloc(devm_region_release, sizeof(struct region_devres), GFP_KERNEL); if (!dr) { free_resource(res); return ERR_PTR(-ENOMEM); } + } else if (dev) { + if (devm_add_action_or_reset(dev, remove_free_mem_region, res)) + return ERR_PTR(-ENOMEM); } write_lock(&resource_lock); - for (; addr > size && addr >= base->start; addr -= size) { - if (__region_intersects(addr, size, 0, IORES_DESC_NONE) != - REGION_DISJOINT) + for (addr = gfr_start(base, size, align, flags); + gfr_continue(base, addr, size, flags); + addr = gfr_next(addr, size, flags)) { + if (__region_intersects(base, addr, size, 0, IORES_DESC_NONE) != + REGION_DISJOINT) continue; - if (__request_region_locked(res, &iomem_resource, addr, size, - name, 0)) - break; + if (flags & GFR_REQUEST_REGION) { + if (__request_region_locked(res, &iomem_resource, addr, + size, name, 0)) + break; - if (dev) { - dr->parent = &iomem_resource; - dr->start = addr; - dr->n = size; - devres_add(dev, dr); - } + if (dev) { + dr->parent = &iomem_resource; + dr->start = addr; + dr->n = size; + devres_add(dev, dr); + } - res->desc = IORES_DESC_DEVICE_PRIVATE_MEMORY; - write_unlock(&resource_lock); + res->desc = desc; + write_unlock(&resource_lock); + + + /* + * A driver is claiming this region so revoke any + * mappings. + */ + revoke_iomem(res); + } else { + res->start = addr; + res->end = addr + size - 1; + res->name = name; + res->desc = desc; + res->flags = IORESOURCE_MEM; + + /* + * Only succeed if the resource hosts an exclusive + * range after the insert + */ + if (__insert_resource(base, res) || res->child) + break; + + write_unlock(&resource_lock); + } - /* - * A driver is claiming this region so revoke any mappings. - */ - revoke_iomem(res); return res; } write_unlock(&resource_lock); - free_resource(res); - if (dr) + if (flags & GFR_REQUEST_REGION) { + free_resource(res); devres_free(dr); + } else if (dev) + devm_release_action(dev, remove_free_mem_region, res); return ERR_PTR(-ERANGE); } @@ -1854,18 +1932,48 @@ static struct resource *__request_free_mem_region(struct device *dev, struct resource *devm_request_free_mem_region(struct device *dev, struct resource *base, unsigned long size) { - return __request_free_mem_region(dev, base, size, dev_name(dev)); + unsigned long flags = GFR_DESCENDING | GFR_REQUEST_REGION; + + return get_free_mem_region(dev, base, size, GFR_DEFAULT_ALIGN, + dev_name(dev), + IORES_DESC_DEVICE_PRIVATE_MEMORY, flags); } EXPORT_SYMBOL_GPL(devm_request_free_mem_region); struct resource *request_free_mem_region(struct resource *base, unsigned long size, const char *name) { - return __request_free_mem_region(NULL, base, size, name); + unsigned long flags = GFR_DESCENDING | GFR_REQUEST_REGION; + + return get_free_mem_region(NULL, base, size, GFR_DEFAULT_ALIGN, name, + IORES_DESC_DEVICE_PRIVATE_MEMORY, flags); } EXPORT_SYMBOL_GPL(request_free_mem_region); -#endif /* CONFIG_DEVICE_PRIVATE */ +/** + * alloc_free_mem_region - find a free region relative to @base + * @base: resource that will parent the new resource + * @size: size in bytes of memory to allocate from @base + * @align: alignment requirements for the allocation + * @name: resource name + * + * Buses like CXL, that can dynamically instantiate new memory regions, + * need a method to allocate physical address space for those regions. + * Allocate and insert a new resource to cover a free, unclaimed by a + * descendant of @base, range in the span of @base. + */ +struct resource *alloc_free_mem_region(struct resource *base, + unsigned long size, unsigned long align, + const char *name) +{ + /* Default of ascending direction and insert resource */ + unsigned long flags = 0; + + return get_free_mem_region(NULL, base, size, align, name, + IORES_DESC_NONE, flags); +} +EXPORT_SYMBOL_NS_GPL(alloc_free_mem_region, CXL); +#endif /* CONFIG_GET_FREE_REGION */ static int __init strict_iomem(char *str) { diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..a5b4fee2e3fd 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -994,9 +994,14 @@ config HMM_MIRROR bool depends on MMU +config GET_FREE_REGION + depends on SPARSEMEM + bool + config DEVICE_PRIVATE bool "Unaddressable device memory (GPU memory, ...)" depends on ZONE_DEVICE + select GET_FREE_REGION help Allows creation of struct pages to represent unaddressable device -- cgit v1.2.3