返回 Discover
Field DispatchStack Overflow7 · 2026-06-01

Grouping text rows in equal-sized groups

Tags
sqlpostgresql
Score
4
Answers
1
Views
101
Answered
No
痛点分析发布于 2026/05/31

痛点为 AI 基于上游原始证据的初步提炼;未包含额外中国市场检索。

痛点

用户需要将大量文本行按字母顺序分组为大小相近的组,以便UI展示为两级层次结构,减少滚动。现有流程中,用户尝试用SQL和自定义评分函数实现分组,但遇到了分组边界重复、产生多余小分组的问题,导致分组结果不符合预期(如出现仅含1-2个元素的组)。这迫使用户手动调整或接受不完美的分组,增加了开发调试时间,且分组逻辑复杂(涉及三角函数、窗口函数等),难以维护和保证一致性。用户明确提到需要物化视图来提升性能,但当前查询无法正确去重边界,造成重复劳动和决策困难。

§ Dossier

Stack Overflow question

I have a table containing a lot of typed/named items and I want to create a (materialized?) view to group them together so that a UI can later make it easier for users to select. Description The goal is: If there are N elements of the same type, make sqrt(N) groups each of size sqrt(N). Instead of having e.g. 10k elements to scroll through, the UI would show 100 lines (the groups), each of which can be expanded into 100 lines (the items inside each group). 1 more click to make but a lot less scrolling. The UI will display group names as the range containing the named elements inside it. For instance, if a group contains all the items whose names start with A and B (but not C and beyond), then the group should be named A-B . If a group contains all the elements starting with C until e.g. EG, but elements starting with EH are the the next group, the group should be name C-EG . The idea is: Instead of comparing the texts and deciding where the lists should be split into groups, I am using a function to score the items. Why I do that will make sense in a minute. Anyway, a heuristics turns that whole mess into a relatively usual problem to solve, that is searching for the local minimums of a function. Here is the twist (that explains why I am building a function): I could make it so that each group has exactly the same size (ignoring the rounding of sqrt(N) to the nearest integer) but instead, I want to find a compromise between group size and length of the group name. As groups are named with the pattern - , I ideally want and to be of length 3 at most. Since I am fine with counts being "reasonably" different from sqrt(N), I am also fine with them being slightly out of date. My idea therefore is to feed the result of this query into a materialized view that is refreshed e.g. every night. The function works that way: For each item, count how many first letters it has in common with the item immediately before it. Take a SIN squared function, with parameters so that it oscillate sqrt(N) times in the range [1-N]. Insert parametrization into the 2 steps above (add constants, multiply by constant, raise to some power ... whatever seems to give good results once I figure the problem out). Multiply the 2 together to get a "score" Look for a minimum score, it will act as the end of the group. If that helps, here is what it looks like for 100 items (names generated randomly with the below queries), with SIN in blue and the score in black. If I want, e.g. shorter group names at the cost of consistent group sizes, I can tweak the score so that the SIN is not just squared anymore but instead raised to a higher power: that will flatten the troughs of the sin-wave / sharpen the crests. By doing so, more items will be place inside the troughs, therefore I will get more chances that a shorter but off boundary will be preferred over a longer boundary that lands on sqrt(N). Here is a zoomed in version of the 2 charts: Here, the heuristics determins the list should be split around rtetx . The group right before this split will be named [...] - RC and the group after this split will be named RT - [...] . Raising the SIN function to a higher power (note: the software I used to draw the chart interpolates the points into a smooth curve. The curve is only correct on the points, not between them) increases the tolerance, like I said. Now, the group can equally be split around rcEYu or slkkE . Now, the group names will be either: [...] - Q and R - [...] Or [...] - R and S - [...] Now, 1 letter is required for the common boundary of both groups whereas 2 letters were required before. Queries I tried the following query Building the data sample with random data CREATE TABLE Test ( ItemType INTEGER, ItemName VARCHAR ); WITH Chars(chars) AS ( SELECT ARRAY[ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ] ) INSERT INTO Test SELECT t, string_agg(chars[random(0, 165)], '') FROM chars CROSS JOIN generate_series(1, 3) t CROSS JOIN generate_series(1, 5) l CROSS JOIN generate_series(1, 400) i GROUP BY t, i; Finding the text to use in group names (query takes 4 AND (stats3.totalnamecount - stats3.NameCount) > 4 AND stats3.totalnamecount > 50 ) SELECT s1.ItemType, s1.NameStarts, s1.NameCount FROM stats4 s1 JOIN ( SELECT stats4.ItemType, min(stats4.score) AS bestscore, stats4.groupnumber FROM stats4 GROUP BY stats4.ItemType, stats4.groupnumber ) s2 ON s1.ItemType = s2.ItemType AND s1.groupnumber = s2.groupnumber WHERE s1.score = s2.bestscore Problem In the above query, I have added the NameCount column to track how many items belong to 1 group. As the sample data consists in 3 groups of 400 items each, I expect 3 types, each of 20 groups, each group having around 20 items (thus this 3rd columns should show 0, 20, 40, 60, ...). However, while I get the correct rows, I also get additional groups with 1-2 items in it. For instance, I will see: For the first group: NameCount = 20 For the second group: NameCount = 39 For the third group: NameCount = 41 When this happens, that means the function calculates 2 group boundaries with equal value; one of them needs to go. I can't seem to find an easy way to remove those extra records. Has anyone got an idea? Edit: To clarify what I am after (some of which will be a repetition, bear with me): An application displays the result of the query in its UI. It can (and probably will) process the output of the query. Ultimately, my goal is to decrease the amount of scrolling required to view any item in the list by grouping them into a 2-level hierarchy. for N items, groups should have sqrt(N) groups, each with around sqrt(N) items. The number of item in each group depends on the heuristic being strict (aims at sqrt(N)) or tolerant (aims at a compromise between sqrt(N) and the length of the group name). sqrt(N) is not guaranteed to be an integer. For 10000, I expect exactly 100 groups of 100 items. For 10001, I can accept having 100 or 101 groups. If you choose to create 101 groups, I do not mind items being spread onto all those groups. The approach in my query (at least in 1 version before I try to fix its issues) was to ignore the issue completely: 100 items per group in the first 100 group, 1 item in the final group (it is only fair the final group is smaller as it appears last and requires more scrolling to reach than any of the other groups). There are 3 reasons why I currently save data in a materialized view: It is faster for the application. New data is not inserted often enough in the table for it to matter (think about 10 records inserted in a 10k table, that will often not change the group). Users visiting the same list several times in a day will be guaranteed to get the same grouping. It is only after they go to sleep that the grouping will be different in the UI. Because of the delay between data insertion and grouping update, the query does not need to (and probably should not) return both the lower and upper bound of each group. For example, if 2 consecutive groups are A-B and E-F , the application will show contiguous segments, such as A-D and E-F . If it did not, then items starting with C would not be findable.

§ Dossier

Question details

View count
101
Answer count
1
Last activity
2026/05/31
§ Dossier

Answers

My first draft has minor performance improvements: reducing joins by using window functions materialise the TRIM(UPPER()) and then index on it The two logical issues it addresses are Avoiding the break-points being close to each other Avoiding selecting two minima in a single "group" The first issue is addressed by offsetting everything by "half a page", and not breaking in the first or last "pages". string old_group old break point new group new break point aaa 0 0 bbb 0 0 ccc 0 0 ddd 0 1 eee 0 yes 1 fff 1 yes 1 yes ggg 1 1 hhh 1 1 iii 1 2 jjj 1 2 (These a total fictional values to illustrate the change.) Old method breaks two pages once each, and yielding three pages, one being tiny: aaa -> eee fff -> fff ggg -> jjj New method breaks one fewer page, and biases the break point the the centre: aaa -> fff ggg -> jjj The second issue is addressed using FIRST_VALUE() instead of MAX() . CREATE TABLE Items ( ItemType INTEGER, ItemName VARCHAR(512), ItemNameUpper VARCHAR(512) GENERATED ALWAYS AS (TRIM(BOTH FROM upper(ItemName))) STORED ); CREATE INDEX ix_test_type_nameUpper ON Items(ItemType, ItemNameUpper); SET LOCAL SEED = 0.5; WITH Chars(chars) AS ( SELECT ARRAY[ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ] ) INSERT INTO Items SELECT t, string_agg(chars[random(0, 165)], '') FROM chars CROSS JOIN generate_series(1, 3) t CROSS JOIN generate_series(1, 5) l CROSS JOIN generate_series(1, 400) i GROUP BY t, i; WITH stats1(ItemType, NameStarts, len, NameCount) AS ( SELECT Items.ItemType, nameStarts, i, SUM(COUNT(*)) OVER (PARTITION BY ItemType, i ORDER BY NameStarts) FROM Items CROSS JOIN generate_series(1, 3) AS i CROSS JOIN LATERAL ( SELECT LEFT(ItemNameUpper, i) ) AS name_starts(nameStarts) GROUP BY Items.ItemType, nameStarts, i ), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS ( SELECT ItemType, MIN(NameStarts), NameCount, MAX(NameCount) OVER (PARTITION BY ItemType) FROM stats1 GROUP BY ItemType, NameCount ), stats4(ItemType, NameStarts, NameCount, naive_group_id, score, new_group_marker, min_score) AS ( SELECT stats3.ItemType, stats3.NameStarts, stats3.NameCount, page.id, page.score, CASE WHEN page.id IN (0, totalnamecount / group_size) THEN 0 WHEN NameCount = FIRST_VALUE(NameCount) OVER (PARTITION BY ItemType, page.id ORDER BY page.score) THEN 1 ELSE 0 END, MIN(page.score) OVER (PARTITION BY ItemType, page.id) FROM stats3 CROSS JOIN LATERAL ( SELECT CASE WHEN SQRT(stats3.totalnamecount) > 1, (LENGTH(stats3.NameStarts) + 3) * (1 + POWER(sin(pi() * stats3.NameCount / group_size), 2)) ) AS page(id, score) ), debug_data AS ( SELECT itemType, nameStarts, nameCount, naive_Group_id, score, min_score, SUM(new_group_marker) OVER (PARTITION BY ItemType ORDER BY namecount) AS group_id FROM Stats4 ORDER BY 1,3,2,4 ) --SELECT * FROM debug_data WHERE ItemType = 2 ORDER BY ItemType, namecount SELECT ItemType, group_id, MIN(namestarts), MAX(namestarts), COUNT(*) AS group_size FROM debug_data GROUP BY itemType, group_id ORDER BY 1, 2 ; itemtype group_id min max group_size 1 0 0 7K 19 1 1 8 AYC 18 1 2 A CX 21 1 3 C DYM 17 1 4 D FGP 23 1 5 F HEH 20 1 6 HF HZE 17 1 7 H JEJ 23 1 8 J KY 20 1 9 K LYP 21 1 10 L MW 17 1 11 M OK 22 1 12 O PJ 20 1 13 P QSC 17 1 14 Q SDO 22 1 15 S TGJ 21 1 16 T UY 19 1 17 U WI 20 1 18 W XT 17 1 19 X ZZF 23 2 0 0 ADL 20 2 1 A BA4 19 2 2 BA BV 20 2 3 B DHM 19 2 4 D EJF 20 2 5 E FT 16 2 6 F HM 21 2 7 H IZH 22 2 8 I JY 16 2 9 J LOA 22 2 10 L MP 20 2 11 M OMB 20 2 12 O PFO 20 2 13 P QYF 21 2 14 Q SO 22 2 15 S UGS 17 2 16 U VR 19 2 17 V XH 21 2 18 X YJB 20 2 19 Y ZZL 21 3 0 1 8V 21 3 1 8 BD8 17 3 2 B CV 21 3 3 C DY 17 3 4 D FP 22 3 5 F GX 17 3 6 G IT 25 3 7 I JT 15 3 8 J LW 24 3 9 L NJ 19 3 10 N OW 18 3 11 O QKC 23 3 12 Q ROB 19 3 13 R SR 19 3 14 S TV 21 3 15 T UXI 18 3 16 U VX 18 3 17 V XV 24 3 18 X YW 17 3 19 Y ZR 21 fiddle Notes: I've changed the formatting to suit myself, that's optional. I've also started replacing stat1 , stat2 , stat3 with useful naming, that should never be optional, it made it very hard to initially decipher your code. I change the scoring function to exaggerate the variable group sizes, that should be easy to change to anything else. I'm certain that this can be optimised significantly further, but I stopped when you commented that you don't care about that.

评论作者信息不可用0 votes
源数据· Raw Archive
source
Stack Overflow
upstream_source
stackoverflow
upstream_item_id
79949208
daily_ranking_item_id
29f93f68-882a-44a8-b55e-4d99cd33608b
rank_date
2026-06-01
rank
7
name
Grouping text rows in equal-sized groups
tagline
sql, postgresql
description
I have a table containing a lot of typed/named items and I want to create a (materialized?) view to group them together so that a UI can later make it easier for users to select. Description The goal is: If there are N elements of the same type, make sqrt(N) groups each of size sqrt(N). Instead of having e.g. 10k elements to scroll through, the UI would show 100 lines (the groups), each of which can be expanded into 100 lines (the items inside each group). 1 more click to make but a lot less scrolling. The UI will display group names as the range containing the named elements inside it. For instance, if a group contains all the items whose names start with A and B (but not C and beyond), then the group should be named A-B . If a group contains all the elements starting with C until e.g. EG, but elements starting with EH are the the next group, the group should be name C-EG . The idea is: Instead of comparing the texts and deciding where the lists should be split into groups, I am using a function to score the items. Why I do that will make sense in a minute. Anyway, a heuristics turns that whole mess into a relatively usual problem to solve, that is searching for the local minimums of a function. Here is the twist (that explains why I am building a function): I could make it so that each group has exactly the same size (ignoring the rounding of sqrt(N) to the nearest integer) but instead, I want to find a compromise between group size and length of the group name. As groups are named with the pattern <start of group> - <end of group> , I ideally want <start of group> and <end of group> to be of length 3 at most. Since I am fine with counts being "reasonably" different from sqrt(N), I am also fine with them being slightly out of date. My idea therefore is to feed the result of this query into a materialized view that is refreshed e.g. every night. The function works that way: For each item, count how many first letters it has in common with the item immediately before it. Take a SIN squared function, with parameters so that it oscillate sqrt(N) times in the range [1-N]. Insert parametrization into the 2 steps above (add constants, multiply by constant, raise to some power ... whatever seems to give good results once I figure the problem out). Multiply the 2 together to get a "score" Look for a minimum score, it will act as the end of the group. If that helps, here is what it looks like for 100 items (names generated randomly with the below queries), with SIN in blue and the score in black. If I want, e.g. shorter group names at the cost of consistent group sizes, I can tweak the score so that the SIN is not just squared anymore but instead raised to a higher power: that will flatten the troughs of the sin-wave / sharpen the crests. By doing so, more items will be place inside the troughs, therefore I will get more chances that a shorter but off boundary will be preferred over a longer boundary that lands on sqrt(N). Here is a zoomed in version of the 2 charts: Here, the heuristics determins the list should be split around rtetx . The group right before this split will be named [...] - RC and the group after this split will be named RT - [...] . Raising the SIN function to a higher power (note: the software I used to draw the chart interpolates the points into a smooth curve. The curve is only correct on the points, not between them) increases the tolerance, like I said. Now, the group can equally be split around rcEYu or slkkE . Now, the group names will be either: [...] - Q and R - [...] Or [...] - R and S - [...] Now, 1 letter is required for the common boundary of both groups whereas 2 letters were required before. Queries I tried the following query Building the data sample with random data CREATE TABLE Test ( ItemType INTEGER, ItemName VARCHAR ); WITH Chars(chars) AS ( SELECT ARRAY[ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ] ) INSERT INTO Test SELECT t, string_agg(chars[random(0, 165)], '') FROM chars CROSS JOIN generate_series(1, 3) t CROSS JOIN generate_series(1, 5) l CROSS JOIN generate_series(1, 400) i GROUP BY t, i; Finding the text to use in group names (query takes < 5sec, suggests it needs a materialized view). WITH Items(ItemType, ItemName) AS ( SELECT ItemType, TRIM(BOTH FROM upper(ItemName)) FROM Test ), stats1(ItemType, NameStarts) AS ( SELECT DISTINCT Items.ItemType, TRIM(BOTH FROM upper(left(ItemName, generate_series(1, 3)))) AS btrim FROM Items ), stats2(ItemType, NameStarts, NameCount) AS ( SELECT s1.ItemType, s1.NameStarts, count(*) FILTER (WHERE left(i.ItemName, length(s1.NameStarts)) <= s1.NameStarts) AS count FROM stats1 s1 JOIN items i ON s1.ItemType = i.ItemType GROUP BY s1.ItemType, s1.NameStarts ), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS ( SELECT stats2.ItemType, min(stats2.NameStarts) AS min, stats2.NameCount, max(stats2.NameCount) OVER (PARTITION BY stats2.ItemType) AS max FROM stats2 GROUP BY stats2.ItemType, stats2.NameCount ), stats4(ItemType, NameStarts, NameCount, score, groupnumber) AS ( SELECT stats3.ItemType, stats3.NameStarts, stats3.NameCount, length(stats3.NameStarts) * (.05 + .95 * power(sin(pi() * stats3.NameCount / sqrt(stats3.totalnamecount)::integer), 2)), ((stats3.NameCount - sqrt(stats3.totalnamecount) / 2) / sqrt(stats3.totalnamecount))::integer AS int4 FROM stats3 WHERE stats3.NameCount > 4 AND (stats3.totalnamecount - stats3.NameCount) > 4 AND stats3.totalnamecount > 50 ) SELECT s1.ItemType, s1.NameStarts, s1.NameCount FROM stats4 s1 JOIN ( SELECT stats4.ItemType, min(stats4.score) AS bestscore, stats4.groupnumber FROM stats4 GROUP BY stats4.ItemType, stats4.groupnumber ) s2 ON s1.ItemType = s2.ItemType AND s1.groupnumber = s2.groupnumber WHERE s1.score = s2.bestscore Problem In the above query, I have added the NameCount column to track how many items belong to 1 group. As the sample data consists in 3 groups of 400 items each, I expect 3 types, each of 20 groups, each group having around 20 items (thus this 3rd columns should show 0, 20, 40, 60, ...). However, while I get the correct rows, I also get additional groups with 1-2 items in it. For instance, I will see: For the first group: NameCount = 20 For the second group: NameCount = 39 For the third group: NameCount = 41 When this happens, that means the function calculates 2 group boundaries with equal value; one of them needs to go. I can't seem to find an easy way to remove those extra records. Has anyone got an idea? Edit: To clarify what I am after (some of which will be a repetition, bear with me): An application displays the result of the query in its UI. It can (and probably will) process the output of the query. Ultimately, my goal is to decrease the amount of scrolling required to view any item in the list by grouping them into a 2-level hierarchy. for N items, groups should have sqrt(N) groups, each with around sqrt(N) items. The number of item in each group depends on the heuristic being strict (aims at sqrt(N)) or tolerant (aims at a compromise between sqrt(N) and the length of the group name). sqrt(N) is not guaranteed to be an integer. For 10000, I expect exactly 100 groups of 100 items. For 10001, I can accept having 100 or 101 groups. If you choose to create 101 groups, I do not mind items being spread onto all those groups. The approach in my query (at least in 1 version before I try to fix its issues) was to ignore the issue completely: 100 items per group in the first 100 group, 1 item in the final group (it is only fair the final group is smaller as it appears last and requires more scrolling to reach than any of the other groups). There are 3 reasons why I currently save data in a materialized view: It is faster for the application. New data is not inserted often enough in the table for it to matter (think about 10 records inserted in a 10k table, that will often not change the group). Users visiting the same list several times in a day will be guaranteed to get the same grouping. It is only after they go to sleep that the grouping will be different in the UI. Because of the delay between data insertion and grouping update, the query does not need to (and probably should not) return both the lower and upper bound of each group. For example, if 2 consecutive groups are A-B and E-F , the application will show contiguous segments, such as A-D and E-F . If it did not, then items starting with C would not be findable.
votes_count
4
comments_count
1
created_at_on_source
2026-05-31T10:30:56.000Z
topics
sqlpostgresql
media / source-specific data
{
  "stackoverflow": {
    "score": 4,
    "view_count": 101,
    "is_answered": false,
    "top_answers": [
      {
        "body": "My first draft has minor performance improvements: reducing joins by using window functions materialise the TRIM(UPPER()) and then index on it The two logical issues it addresses are Avoiding the break-points being close to each other Avoiding selecting two minima in a single \"group\" The first issue is addressed by offsetting everything by \"half a page\", and not breaking in the first or last \"pages\". string old_group old break point new group new break point aaa 0 0 bbb 0 0 ccc 0 0 ddd 0 1 eee 0 yes 1 fff 1 yes 1 yes ggg 1 1 hhh 1 1 iii 1 2 jjj 1 2 (These a total fictional values to illustrate the change.) Old method breaks two pages once each, and yielding three pages, one being tiny: aaa -> eee fff -> fff ggg -> jjj New method breaks one fewer page, and biases the break point the the centre: aaa -> fff ggg -> jjj The second issue is addressed using FIRST_VALUE() instead of MAX() . CREATE TABLE Items ( ItemType INTEGER, ItemName VARCHAR(512), ItemNameUpper VARCHAR(512) GENERATED ALWAYS AS (TRIM(BOTH FROM upper(ItemName))) STORED ); CREATE INDEX ix_test_type_nameUpper ON Items(ItemType, ItemNameUpper); SET LOCAL SEED = 0.5; WITH Chars(chars) AS ( SELECT ARRAY[ 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ] ) INSERT INTO Items SELECT t, string_agg(chars[random(0, 165)], '') FROM chars CROSS JOIN generate_series(1, 3) t CROSS JOIN generate_series(1, 5) l CROSS JOIN generate_series(1, 400) i GROUP BY t, i; WITH stats1(ItemType, NameStarts, len, NameCount) AS ( SELECT Items.ItemType, nameStarts, i, SUM(COUNT(*)) OVER (PARTITION BY ItemType, i ORDER BY NameStarts) FROM Items CROSS JOIN generate_series(1, 3) AS i CROSS JOIN LATERAL ( SELECT LEFT(ItemNameUpper, i) ) AS name_starts(nameStarts) GROUP BY Items.ItemType, nameStarts, i ), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS ( SELECT ItemType, MIN(NameStarts), NameCount, MAX(NameCount) OVER (PARTITION BY ItemType) FROM stats1 GROUP BY ItemType, NameCount ), stats4(ItemType, NameStarts, NameCount, naive_group_id, score, new_group_marker, min_score) AS ( SELECT stats3.ItemType, stats3.NameStarts, stats3.NameCount, page.id, page.score, CASE WHEN page.id IN (0, totalnamecount / group_size) THEN 0 WHEN NameCount = FIRST_VALUE(NameCount) OVER (PARTITION BY ItemType, page.id ORDER BY page.score) THEN 1 ELSE 0 END, MIN(page.score) OVER (PARTITION BY ItemType, page.id) FROM stats3 CROSS JOIN LATERAL ( SELECT CASE WHEN SQRT(stats3.totalnamecount) < 20 THEN stats3.totalnamecount ELSE SQRT(stats3.totalnamecount)::int END ) AS group_size(group_size) CROSS JOIN LATERAL ( SELECT (stats3.NameCount / (group_size / 2) + 0.5)::int >> 1, (LENGTH(stats3.NameStarts) + 3) * (1 + POWER(sin(pi() * stats3.NameCount / group_size), 2)) ) AS page(id, score) ), debug_data AS ( SELECT itemType, nameStarts, nameCount, naive_Group_id, score, min_score, SUM(new_group_marker) OVER (PARTITION BY ItemType ORDER BY namecount) AS group_id FROM Stats4 ORDER BY 1,3,2,4 ) --SELECT * FROM debug_data WHERE ItemType = 2 ORDER BY ItemType, namecount SELECT ItemType, group_id, MIN(namestarts), MAX(namestarts), COUNT(*) AS group_size FROM debug_data GROUP BY itemType, group_id ORDER BY 1, 2 ; itemtype group_id min max group_size 1 0 0 7K 19 1 1 8 AYC 18 1 2 A CX 21 1 3 C DYM 17 1 4 D FGP 23 1 5 F HEH 20 1 6 HF HZE 17 1 7 H JEJ 23 1 8 J KY 20 1 9 K LYP 21 1 10 L MW 17 1 11 M OK 22 1 12 O PJ 20 1 13 P QSC 17 1 14 Q SDO 22 1 15 S TGJ 21 1 16 T UY 19 1 17 U WI 20 1 18 W XT 17 1 19 X ZZF 23 2 0 0 ADL 20 2 1 A BA4 19 2 2 BA BV 20 2 3 B DHM 19 2 4 D EJF 20 2 5 E FT 16 2 6 F HM 21 2 7 H IZH 22 2 8 I JY 16 2 9 J LOA 22 2 10 L MP 20 2 11 M OMB 20 2 12 O PFO 20 2 13 P QYF 21 2 14 Q SO 22 2 15 S UGS 17 2 16 U VR 19 2 17 V XH 21 2 18 X YJB 20 2 19 Y ZZL 21 3 0 1 8V 21 3 1 8 BD8 17 3 2 B CV 21 3 3 C DY 17 3 4 D FP 22 3 5 F GX 17 3 6 G IT 25 3 7 I JT 15 3 8 J LW 24 3 9 L NJ 19 3 10 N OW 18 3 11 O QKC 23 3 12 Q ROB 19 3 13 R SR 19 3 14 S TV 21 3 15 T UXI 18 3 16 U VX 18 3 17 V XV 24 3 18 X YW 17 3 19 Y ZR 21 fiddle Notes: I've changed the formatting to suit myself, that's optional. I've also started replacing stat1 , stat2 , stat3 with useful naming, that should never be optional, it made it very hard to initially decipher your code. I change the scoring function to exaggerate the variable group sizes, that should be easy to change to anything else. I'm certain that this can be optimised significantly further, but I stopped when you commented that you don't care about that.",
        "score": 0,
        "answer_id": 79949330,
        "is_accepted": false
      }
    ],
    "answer_count": 1,
    "accepted_answer_id": null,
    "last_activity_date": 1780264651
  }
}
raw_payload
{
  "stats": {
    "score": 4,
    "view_count": 101,
    "is_answered": false,
    "answer_count": 1,
    "creation_date": 1780223456,
    "last_edit_date": 1780264651,
    "accepted_answer_id": null,
    "last_activity_date": 1780264651
  },
  "api_wrapper": {
    "backoff": null,
    "has_more": true,
    "page_size": 8,
    "quota_max": 300,
    "quota_remaining": 248
  },
  "question_id": 79949208,
  "answer_fetch": {
    "has_more": false,
    "answers_fetched": 1,
    "answer_page_size": 3
  },
  "snapshot_version": "stackoverflow_question_v1"
}
source_raw_snapshot
{
  "id": "a3c8ca4a-52d4-4e05-b6bb-873318adaf23",
  "daily_ranking_item_id": "29f93f68-882a-44a8-b55e-4d99cd33608b",
  "source": "stackoverflow",
  "external_id": "79949208",
  "fetched_at": "2026-05-31T22:01:57.241Z",
  "question_raw": {
    "body": "<p>I have a table containing a lot of typed/named items and I want to create a (materialized?) view to group them together so that a UI can later make it easier for users to select.</p>\n<h1>Description</h1>\n<h2>The goal is:</h2>\n<p>If there are N elements of the same type, make sqrt(N) groups each of size sqrt(N). Instead of having e.g. 10k elements to scroll through, the UI would show 100 lines (the groups), each of which can be expanded into 100 lines (the items inside each group). 1 more click to make but a lot less scrolling.</p>\n<p>The UI will display group names as the range containing the named elements inside it.<br />\nFor instance, if a group contains all the items whose names start with A and B (but not C and beyond), then the group should be named <code>A-B</code>.<br />\nIf a group contains all the elements starting with C until e.g. EG, but elements starting with EH are the the next group, the group should be name <code>C-EG</code>.</p>\n<h2>The idea is:</h2>\n<p>Instead of comparing the texts and deciding where the lists should be split into groups, I am using a function to score the items. Why I do that will make sense in a minute.</p>\n<p>Anyway, a heuristics turns that whole mess into a relatively usual problem to solve, that is searching for the local minimums of a function.</p>\n<h2>Here is the twist (that explains why I am building a function):</h2>\n<p>I could make it so that each group has exactly the same size (ignoring the rounding of sqrt(N) to the nearest integer) but instead, I want to find a compromise between group size and length of the group name. As groups are named with the pattern <code>&lt;start of group&gt; - &lt;end of group&gt;</code>, I ideally want <code>&lt;start of group&gt;</code> and <code>&lt;end of group&gt;</code> to be of length 3 at most.<br/>\nSince I am fine with counts being &quot;reasonably&quot; different from sqrt(N), I am also fine with them being slightly out of date. My idea therefore is to feed the result of this query into a materialized view that is refreshed e.g. every night.</p>\n<p>The function works that way:</p>\n<ol>\n<li>For each item, count how many first letters it has in common with the item immediately before it.</li>\n<li>Take a SIN squared function, with parameters so that it oscillate sqrt(N) times in the range [1-N].</li>\n<li>Insert parametrization into the 2 steps above (add constants, multiply by constant, raise to some power ... whatever seems to give good results once I figure the problem out).</li>\n<li>Multiply the 2 together to get a &quot;score&quot;</li>\n<li>Look for a minimum score, it will act as the end of the group.</li>\n</ol>\n<p>If that helps, here is what it looks like for 100 items (names generated randomly with the below queries), with SIN in blue and the score in black.<img src=\"https://i.sstatic.net/M6vC8WTp.png\" alt=\"Sin on 100 items\" /></p>\n<p>If I want, e.g. shorter group names at the cost of consistent group sizes, I can tweak the score so that the SIN is not just squared anymore but instead raised to a higher power: that will flatten the troughs of the sin-wave / sharpen the crests.<br/>\nBy doing so, more items will be place inside the troughs, therefore I will get more chances that a shorter but off boundary will be preferred over a longer boundary that lands on sqrt(N).\n<img src=\"https://i.sstatic.net/g6QQkiIz.png\" alt=\"Another power\" /></p>\n<p>Here is a zoomed in version of the 2 charts:</p>\n<p><a href=\"https://i.sstatic.net/jqCzP0Fd.png\" rel=\"nofollow noreferrer\"><img src=\"https://i.sstatic.net/jqCzP0Fd.png\" alt=\"Zoom 1\" /></a>\nHere, the heuristics determins the list should be split around <code>rtetx</code>.\nThe group right before this split will be named <code>[...] - RC</code> and the group after this split will be named <code>RT - [...]</code>.</p>\n<p><a href=\"https://i.sstatic.net/4h5sfzRL.png\" rel=\"nofollow noreferrer\"><img src=\"https://i.sstatic.net/4h5sfzRL.png\" alt=\"Zoom 2\" /></a>\nRaising the SIN function to a higher power (note: the software I used to draw the chart interpolates the points into a smooth curve. The curve is only correct on the points, not between them) increases the tolerance, like I said. Now, the group can equally be split around <code>rcEYu</code> or <code>slkkE</code>.\nNow, the group names will be either:</p>\n<ul>\n<li><code>[...] - Q</code> and <code>R - [...]</code></li>\n<li>Or <code>[...] - R</code> and <code>S - [...]</code></li>\n</ul>\n<p>Now, 1 letter is required for the common boundary of both groups whereas 2 letters were required before.</p>\n<hr />\n<h1>Queries</h1>\n<p>I tried the following query</p>\n<ol>\n<li>Building the data sample with random data</li>\n</ol>\n<pre><code>CREATE TABLE Test (\n ItemType INTEGER,\n ItemName VARCHAR\n);\n\nWITH Chars(chars) AS (\n SELECT ARRAY[\n     'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'\n ]\n)\nINSERT INTO Test\nSELECT t, string_agg(chars[random(0, 165)], '')\nFROM chars\nCROSS JOIN generate_series(1, 3) t\nCROSS JOIN generate_series(1, 5) l\nCROSS JOIN generate_series(1, 400) i\nGROUP BY t, i;\n</code></pre>\n<ol start=\"2\">\n<li>Finding the text to use in group names (query takes &lt; 5sec, suggests it needs a materialized view).</li>\n</ol>\n<pre><code>WITH Items(ItemType, ItemName) AS (\n SELECT ItemType, TRIM(BOTH FROM upper(ItemName)) FROM Test\n), stats1(ItemType, NameStarts) AS (\n SELECT DISTINCT Items.ItemType, TRIM(BOTH FROM upper(left(ItemName, generate_series(1, 3)))) AS btrim\n FROM Items\n ), stats2(ItemType, NameStarts, NameCount) AS (\nSELECT s1.ItemType, s1.NameStarts,\n count(*) FILTER (WHERE left(i.ItemName, length(s1.NameStarts)) &lt;= s1.NameStarts) AS count\n FROM stats1 s1\n JOIN items i ON s1.ItemType = i.ItemType\n GROUP BY s1.ItemType, s1.NameStarts\n), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS (\nSELECT stats2.ItemType,\n min(stats2.NameStarts) AS min,\n stats2.NameCount,\n max(stats2.NameCount) OVER (PARTITION BY stats2.ItemType) AS max\n FROM stats2\n GROUP BY stats2.ItemType, stats2.NameCount\n), stats4(ItemType, NameStarts, NameCount, score, groupnumber) AS (\n SELECT stats3.ItemType,\n stats3.NameStarts,\n stats3.NameCount,\n     length(stats3.NameStarts) * (.05 + .95 * power(sin(pi() * stats3.NameCount / sqrt(stats3.totalnamecount)::integer), 2)),\n ((stats3.NameCount - sqrt(stats3.totalnamecount) / 2) / sqrt(stats3.totalnamecount))::integer AS int4\n FROM stats3\n WHERE stats3.NameCount &gt; 4 AND (stats3.totalnamecount - stats3.NameCount) &gt; 4 AND stats3.totalnamecount &gt; 50\n)\nSELECT s1.ItemType, s1.NameStarts, s1.NameCount\nFROM stats4 s1\nJOIN (\n SELECT stats4.ItemType, min(stats4.score) AS bestscore, stats4.groupnumber\n FROM stats4\n GROUP BY stats4.ItemType, stats4.groupnumber\n) s2 ON s1.ItemType = s2.ItemType AND s1.groupnumber = s2.groupnumber\nWHERE s1.score = s2.bestscore\n</code></pre>\n<hr />\n<h1>Problem</h1>\n<p>In the above query, I have added the <code>NameCount</code> column to track how many items belong to 1 group. As the sample data consists in 3 groups of 400 items each, I expect 3 types, each of 20 groups, each group having around 20 items (thus this 3rd columns should show 0, 20, 40, 60, ...).<br/>\nHowever, while I get the correct rows, I also get additional groups with 1-2 items in it. For instance, I will see:</p>\n<ul>\n<li>For the first group: <code>NameCount</code> = 20</li>\n<li>For the second group: <code>NameCount</code> = 39</li>\n<li>For the third group: <code>NameCount</code> = 41</li>\n</ul>\n<p>When this happens, that means the function calculates 2 group boundaries with equal value; one of them needs to go.</p>\n<p>I can't seem to find an easy way to remove those extra records. Has anyone got an idea?</p>\n<hr />\n<p>Edit: To clarify what I am after (some of which will be a repetition, bear with me):</p>\n<ol>\n<li>An application displays the result of the query in its UI. It can (and probably will) process the output of the query.</li>\n<li>Ultimately, my goal is to decrease the amount of scrolling required to view any item in the list by grouping them into a 2-level hierarchy.</li>\n<li>for N items, groups should have sqrt(N) groups, each with around sqrt(N) items. The number of item in each group depends on the heuristic being strict (aims at sqrt(N)) or tolerant (aims at a compromise between sqrt(N) and the length of the group name).</li>\n<li>sqrt(N) is not guaranteed to be an integer.<br/>For 10000, I expect exactly 100 groups of 100 items.<br/>For 10001, I can accept having 100 or 101 groups. If you choose to create 101 groups, I do not mind items being spread onto all those groups.<br/>The approach in my query (at least in 1 version before I try to fix its issues) was to ignore the issue completely: 100 items per group in the first 100 group, 1 item in the final group (it is only fair the final group is smaller as it appears last and requires more scrolling to reach than any of the other groups).</li>\n<li>There are 3 reasons why I currently save data in a materialized view:\n<ol>\n<li>It is faster for the application.</li>\n<li>New data is not inserted often enough in the table for it to matter (think about 10 records inserted in a 10k table, that will often not change the group).</li>\n<li>Users visiting the same list several times in a day will be guaranteed to get the same grouping. It is only after they go to sleep that the grouping will be different in the UI.</li>\n</ol>\n</li>\n<li>Because of the delay between data insertion and grouping update, the query does not need to (and probably should not) return both the lower and upper bound of each group.<br/>\nFor example, if 2 consecutive groups are <code>A-B</code> and <code>E-F</code>, the application will show contiguous segments, such as <code>A-D</code> and <code>E-F</code>. If it did not, then items starting with <code>C</code> would not be findable.</li>\n</ol>\n",
    "link": "https://stackoverflow.com/questions/79949208/grouping-text-rows-in-equal-sized-groups",
    "tags": [
      "sql",
      "postgresql"
    ],
    "owner": {
      "link": "https://stackoverflow.com/users/20143744/atmo",
      "user_id": 20143744,
      "user_type": "registered",
      "account_id": 26507904,
      "reputation": 4276,
      "display_name": "Atmo",
      "profile_image": "https://www.gravatar.com/avatar/633cbcb984c15d92bf1decb4354fb62b?s=256&d=identicon&r=PG"
    },
    "score": 4,
    "title": "Grouping text rows in equal-sized groups",
    "view_count": 101,
    "is_answered": false,
    "question_id": 79949208,
    "answer_count": 1,
    "creation_date": 1780223456,
    "last_edit_date": 1780264651,
    "content_license": "CC BY-SA 4.0",
    "last_activity_date": 1780264651
  },
  "answers_raw": [
    {
      "body": "<p>My first draft has minor performance improvements:</p>\n<ul>\n<li>reducing joins by using window functions</li>\n<li>materialise the <code>TRIM(UPPER())</code> and then index on it</li>\n</ul>\n<p>The two logical issues it addresses are</p>\n<ul>\n<li>Avoiding the break-points being close to each other</li>\n<li>Avoiding selecting two minima in a single &quot;group&quot;</li>\n</ul>\n<p><em><strong>The first issue</strong></em> is addressed by offsetting everything by &quot;half a page&quot;, and not breaking in the first or last &quot;pages&quot;.</p>\n<div class=\"s-table-container\"><table class=\"s-table\">\n<thead>\n<tr>\n<th>string</th>\n<th>old_group</th>\n<th>old break point</th>\n<th>new group</th>\n<th>new break point</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>aaa</td>\n<td>0</td>\n<td></td>\n<td>0</td>\n<td></td>\n</tr>\n<tr>\n<td>bbb</td>\n<td>0</td>\n<td></td>\n<td>0</td>\n<td></td>\n</tr>\n<tr>\n<td>ccc</td>\n<td>0</td>\n<td></td>\n<td>0</td>\n<td></td>\n</tr>\n<tr>\n<td>ddd</td>\n<td>0</td>\n<td></td>\n<td>1</td>\n<td></td>\n</tr>\n<tr>\n<td>eee</td>\n<td>0</td>\n<td>yes</td>\n<td>1</td>\n<td></td>\n</tr>\n<tr>\n<td>fff</td>\n<td>1</td>\n<td>yes</td>\n<td>1</td>\n<td>yes</td>\n</tr>\n<tr>\n<td>ggg</td>\n<td>1</td>\n<td></td>\n<td>1</td>\n<td></td>\n</tr>\n<tr>\n<td>hhh</td>\n<td>1</td>\n<td></td>\n<td>1</td>\n<td></td>\n</tr>\n<tr>\n<td>iii</td>\n<td>1</td>\n<td></td>\n<td>2</td>\n<td></td>\n</tr>\n<tr>\n<td>jjj</td>\n<td>1</td>\n<td></td>\n<td>2</td>\n<td></td>\n</tr>\n</tbody>\n</table></div>\n<p>(These a total fictional values to illustrate the change.)</p>\n<p>Old method breaks two pages once each, and yielding three pages, one being tiny:</p>\n<ul>\n<li>aaa -&gt; eee</li>\n<li>fff -&gt; fff</li>\n<li>ggg -&gt; jjj</li>\n</ul>\n<p>New method breaks one fewer page, and biases the break point the the centre:</p>\n<ul>\n<li>aaa -&gt; fff</li>\n<li>ggg -&gt; jjj</li>\n</ul>\n<p><em><strong>The second issue</strong></em> is addressed using <code>FIRST_VALUE()</code> instead of <code>MAX()</code>.</p>\n<pre><code>CREATE TABLE Items (\n  ItemType INTEGER,\n  ItemName VARCHAR(512),\n  ItemNameUpper VARCHAR(512) GENERATED ALWAYS AS (TRIM(BOTH FROM upper(ItemName))) STORED\n);\nCREATE INDEX ix_test_type_nameUpper ON Items(ItemType, ItemNameUpper);\n\nSET LOCAL SEED = 0.5;\n\nWITH Chars(chars) AS (\n SELECT ARRAY[\n     'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',\n     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'\n ]\n)\nINSERT INTO Items\nSELECT t, string_agg(chars[random(0, 165)], '')\nFROM chars\nCROSS JOIN generate_series(1, 3) t\nCROSS JOIN generate_series(1, 5) l\nCROSS JOIN generate_series(1, 400) i\nGROUP BY t, i;\n</code></pre>\n<pre><code>WITH\n  stats1(ItemType, NameStarts, len, NameCount)\nAS\n(\n  SELECT\n    Items.ItemType,\n    nameStarts,\n    i,\n    SUM(COUNT(*)) OVER (PARTITION BY ItemType, i ORDER BY NameStarts)\n  FROM\n    Items\n  CROSS JOIN\n    generate_series(1, 3)  AS i\n  CROSS JOIN LATERAL\n  (\n    SELECT LEFT(ItemNameUpper, i)\n  )\n    AS name_starts(nameStarts)\n  GROUP BY\n    Items.ItemType,\n    nameStarts,\n    i    \n),\n   stats3(ItemType, NameStarts, NameCount, totalnamecount)\nAS\n(\n  SELECT\n    ItemType,\n    MIN(NameStarts),\n    NameCount,\n    MAX(NameCount) OVER (PARTITION BY ItemType)\n FROM\n    stats1\n GROUP BY\n    ItemType,\n    NameCount\n),\n  stats4(ItemType, NameStarts, NameCount, naive_group_id, score, new_group_marker, min_score)\nAS\n(\n  SELECT\n    stats3.ItemType,\n    stats3.NameStarts,\n    stats3.NameCount,\n    page.id,\n    page.score,\n    CASE\n      WHEN page.id IN (0, totalnamecount / group_size)                                                  THEN 0\n      WHEN NameCount = FIRST_VALUE(NameCount) OVER (PARTITION BY ItemType, page.id ORDER BY page.score) THEN 1\n                                                                                                        ELSE 0\n    END,\n    MIN(page.score) OVER (PARTITION BY ItemType, page.id)\n  FROM\n    stats3\n  CROSS JOIN LATERAL\n  (\n    SELECT\n      CASE\n        WHEN SQRT(stats3.totalnamecount) &lt; 20\n        THEN stats3.totalnamecount\n        ELSE SQRT(stats3.totalnamecount)::int\n     END\n  )\n    AS group_size(group_size)\n  CROSS JOIN LATERAL\n  (\n    SELECT\n      (stats3.NameCount / (group_size / 2) + 0.5)::int &gt;&gt; 1,\n      (LENGTH(stats3.NameStarts) + 3) * (1 + POWER(sin(pi() * stats3.NameCount / group_size), 2))\n  )\n    AS page(id, score)\n),\n  debug_data\nAS\n(\n  SELECT\n    itemType,\n    nameStarts,\n    nameCount,\n    naive_Group_id,\n    score,\n    min_score,\n    SUM(new_group_marker) OVER (PARTITION BY ItemType ORDER BY namecount)  AS group_id\n  FROM\n    Stats4\n  ORDER BY\n    1,3,2,4\n)\n--SELECT * FROM debug_data WHERE ItemType = 2 ORDER BY ItemType, namecount\nSELECT ItemType, group_id, MIN(namestarts), MAX(namestarts), COUNT(*) AS group_size FROM debug_data GROUP BY itemType, group_id ORDER BY 1, 2\n;\n</code></pre>\n<div class=\"s-table-container\"><table class=\"s-table\">\n<thead>\n<tr>\n<th style=\"text-align: right;\">itemtype</th>\n<th style=\"text-align: right;\">group_id</th>\n<th style=\"text-align: left;\">min</th>\n<th style=\"text-align: left;\">max</th>\n<th style=\"text-align: right;\">group_size</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">0</td>\n<td style=\"text-align: left;\">0</td>\n<td style=\"text-align: left;\">7K</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: left;\">8</td>\n<td style=\"text-align: left;\">AYC</td>\n<td style=\"text-align: right;\">18</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: left;\">A</td>\n<td style=\"text-align: left;\">CX</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: left;\">C</td>\n<td style=\"text-align: left;\">DYM</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">4</td>\n<td style=\"text-align: left;\">D</td>\n<td style=\"text-align: left;\">FGP</td>\n<td style=\"text-align: right;\">23</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">5</td>\n<td style=\"text-align: left;\">F</td>\n<td style=\"text-align: left;\">HEH</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">6</td>\n<td style=\"text-align: left;\">HF</td>\n<td style=\"text-align: left;\">HZE</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">7</td>\n<td style=\"text-align: left;\">H</td>\n<td style=\"text-align: left;\">JEJ</td>\n<td style=\"text-align: right;\">23</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">8</td>\n<td style=\"text-align: left;\">J</td>\n<td style=\"text-align: left;\">KY</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">9</td>\n<td style=\"text-align: left;\">K</td>\n<td style=\"text-align: left;\">LYP</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">10</td>\n<td style=\"text-align: left;\">L</td>\n<td style=\"text-align: left;\">MW</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">11</td>\n<td style=\"text-align: left;\">M</td>\n<td style=\"text-align: left;\">OK</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">12</td>\n<td style=\"text-align: left;\">O</td>\n<td style=\"text-align: left;\">PJ</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">13</td>\n<td style=\"text-align: left;\">P</td>\n<td style=\"text-align: left;\">QSC</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">14</td>\n<td style=\"text-align: left;\">Q</td>\n<td style=\"text-align: left;\">SDO</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">15</td>\n<td style=\"text-align: left;\">S</td>\n<td style=\"text-align: left;\">TGJ</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">16</td>\n<td style=\"text-align: left;\">T</td>\n<td style=\"text-align: left;\">UY</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">17</td>\n<td style=\"text-align: left;\">U</td>\n<td style=\"text-align: left;\">WI</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">18</td>\n<td style=\"text-align: left;\">W</td>\n<td style=\"text-align: left;\">XT</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: right;\">19</td>\n<td style=\"text-align: left;\">X</td>\n<td style=\"text-align: left;\">ZZF</td>\n<td style=\"text-align: right;\">23</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">0</td>\n<td style=\"text-align: left;\">0</td>\n<td style=\"text-align: left;\">ADL</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: left;\">A</td>\n<td style=\"text-align: left;\">BA4</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: left;\">BA</td>\n<td style=\"text-align: left;\">BV</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: left;\">B</td>\n<td style=\"text-align: left;\">DHM</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">4</td>\n<td style=\"text-align: left;\">D</td>\n<td style=\"text-align: left;\">EJF</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">5</td>\n<td style=\"text-align: left;\">E</td>\n<td style=\"text-align: left;\">FT</td>\n<td style=\"text-align: right;\">16</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">6</td>\n<td style=\"text-align: left;\">F</td>\n<td style=\"text-align: left;\">HM</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">7</td>\n<td style=\"text-align: left;\">H</td>\n<td style=\"text-align: left;\">IZH</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">8</td>\n<td style=\"text-align: left;\">I</td>\n<td style=\"text-align: left;\">JY</td>\n<td style=\"text-align: right;\">16</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">9</td>\n<td style=\"text-align: left;\">J</td>\n<td style=\"text-align: left;\">LOA</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">10</td>\n<td style=\"text-align: left;\">L</td>\n<td style=\"text-align: left;\">MP</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">11</td>\n<td style=\"text-align: left;\">M</td>\n<td style=\"text-align: left;\">OMB</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">12</td>\n<td style=\"text-align: left;\">O</td>\n<td style=\"text-align: left;\">PFO</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">13</td>\n<td style=\"text-align: left;\">P</td>\n<td style=\"text-align: left;\">QYF</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">14</td>\n<td style=\"text-align: left;\">Q</td>\n<td style=\"text-align: left;\">SO</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">15</td>\n<td style=\"text-align: left;\">S</td>\n<td style=\"text-align: left;\">UGS</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">16</td>\n<td style=\"text-align: left;\">U</td>\n<td style=\"text-align: left;\">VR</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">17</td>\n<td style=\"text-align: left;\">V</td>\n<td style=\"text-align: left;\">XH</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">18</td>\n<td style=\"text-align: left;\">X</td>\n<td style=\"text-align: left;\">YJB</td>\n<td style=\"text-align: right;\">20</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: right;\">19</td>\n<td style=\"text-align: left;\">Y</td>\n<td style=\"text-align: left;\">ZZL</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">0</td>\n<td style=\"text-align: left;\">1</td>\n<td style=\"text-align: left;\">8V</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">1</td>\n<td style=\"text-align: left;\">8</td>\n<td style=\"text-align: left;\">BD8</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">2</td>\n<td style=\"text-align: left;\">B</td>\n<td style=\"text-align: left;\">CV</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: left;\">C</td>\n<td style=\"text-align: left;\">DY</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">4</td>\n<td style=\"text-align: left;\">D</td>\n<td style=\"text-align: left;\">FP</td>\n<td style=\"text-align: right;\">22</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">5</td>\n<td style=\"text-align: left;\">F</td>\n<td style=\"text-align: left;\">GX</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">6</td>\n<td style=\"text-align: left;\">G</td>\n<td style=\"text-align: left;\">IT</td>\n<td style=\"text-align: right;\">25</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">7</td>\n<td style=\"text-align: left;\">I</td>\n<td style=\"text-align: left;\">JT</td>\n<td style=\"text-align: right;\">15</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">8</td>\n<td style=\"text-align: left;\">J</td>\n<td style=\"text-align: left;\">LW</td>\n<td style=\"text-align: right;\">24</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">9</td>\n<td style=\"text-align: left;\">L</td>\n<td style=\"text-align: left;\">NJ</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">10</td>\n<td style=\"text-align: left;\">N</td>\n<td style=\"text-align: left;\">OW</td>\n<td style=\"text-align: right;\">18</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">11</td>\n<td style=\"text-align: left;\">O</td>\n<td style=\"text-align: left;\">QKC</td>\n<td style=\"text-align: right;\">23</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">12</td>\n<td style=\"text-align: left;\">Q</td>\n<td style=\"text-align: left;\">ROB</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">13</td>\n<td style=\"text-align: left;\">R</td>\n<td style=\"text-align: left;\">SR</td>\n<td style=\"text-align: right;\">19</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">14</td>\n<td style=\"text-align: left;\">S</td>\n<td style=\"text-align: left;\">TV</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">15</td>\n<td style=\"text-align: left;\">T</td>\n<td style=\"text-align: left;\">UXI</td>\n<td style=\"text-align: right;\">18</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">16</td>\n<td style=\"text-align: left;\">U</td>\n<td style=\"text-align: left;\">VX</td>\n<td style=\"text-align: right;\">18</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">17</td>\n<td style=\"text-align: left;\">V</td>\n<td style=\"text-align: left;\">XV</td>\n<td style=\"text-align: right;\">24</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">18</td>\n<td style=\"text-align: left;\">X</td>\n<td style=\"text-align: left;\">YW</td>\n<td style=\"text-align: right;\">17</td>\n</tr>\n<tr>\n<td style=\"text-align: right;\">3</td>\n<td style=\"text-align: right;\">19</td>\n<td style=\"text-align: left;\">Y</td>\n<td style=\"text-align: left;\">ZR</td>\n<td style=\"text-align: right;\">21</td>\n</tr>\n</tbody>\n</table></div>\n<p><a href=\"https://dbfiddle.uk/FsKwZcex\" rel=\"nofollow noreferrer\">fiddle</a></p>\n<p><em><strong>Notes:</strong></em></p>\n<ul>\n<li>I've changed the formatting to suit myself, that's optional.</li>\n<li>I've also started replacing <code>stat1</code>, <code>stat2</code>, <code>stat3</code> with useful naming, that should never be optional, it made it very hard to initially decipher your code.</li>\n<li>I change the scoring function to exaggerate the variable group sizes, that should be easy to change to anything else.</li>\n</ul>\n<p>I'm <em><strong>certain</strong></em> that this can be optimised significantly further, but I stopped when you commented that you don't care about that.</p>\n",
      "owner": {
        "link": "https://stackoverflow.com/users/53341/matbailie",
        "user_id": 53341,
        "user_type": "registered",
        "account_id": 21819,
        "reputation": 87950,
        "accept_rate": 93,
        "display_name": "MatBailie",
        "profile_image": "https://i.sstatic.net/98geJ.png?s=256"
      },
      "score": 0,
      "answer_id": 79949330,
      "is_accepted": false,
      "question_id": 79949208,
      "creation_date": 1780245859,
      "content_license": "CC BY-SA 4.0",
      "last_activity_date": 1780245859
    }
  ],
  "tags_raw": [
    "sql",
    "postgresql"
  ],
  "stats_raw": {
    "score": 4,
    "view_count": 101,
    "is_answered": false,
    "answer_count": 1,
    "creation_date": 1780223456,
    "last_edit_date": 1780264651,
    "accepted_answer_id": null,
    "last_activity_date": 1780264651
  },
  "selection_meta": {
    "site": "stackoverflow",
    "api_wrapper": {
      "backoff": null,
      "has_more": true,
      "page_size": 8,
      "quota_max": 300,
      "quota_remaining": 248
    },
    "answer_fetch": {
      "backoff": null,
      "has_more": false,
      "answers_fetched": 1,
      "quota_remaining": 219,
      "answer_page_size": 3
    },
    "snapshot_version": "stackoverflow_question_v1",
    "selection_strategy": "tag_whitelist_unanswered_high_score_recent_active"
  },
  "created_at": "2026-05-31T22:01:57.450Z",
  "updated_at": "2026-05-31T22:01:57.450Z"
}