Burning in Greenplum cluster servers

Posted by scottk on Feb 3, 2011 in Ramblings |
In order to do some benchmarking for a Greenplum cluster I’ve modified the gensort program to generate a repeatable set of data that I can use to mimic impression and click data. I managed to put this to the test recently as we were breaking in some new hardware. The idea being to get the nodes in a cluster to generate the data and then load them into the db and do some basic lookups on the data to get some stats. In this configuration I’m running 16 nodes (servers) with 8 primary segments a piece, for a total of 128 segments.
[gpadmin@mdw ~]$
Timing is on.
STARTING RUN OF BIG DATA GEN | 2011-01-25 07:54:56.393134-06
These Greenplum external tables are actually calls to three programs for each dataset I’m doing.
These tables when called execute programs which
  1. Create data file with my data gen program
  2. Access data via cat (not the feline) for data importing
  3. Delete the file
START CREATE EXTERNAL TABLES | 2011-01-25 07:54:56.393928-06
CREATE EXTERNAL TABLE
Time: 749.494 ms
CREATE EXTERNAL TABLE
Time: 299.153 ms
CREATE EXTERNAL TABLE
Time: 255.646 ms
CREATE EXTERNAL TABLE
Time: 173.839 ms
CREATE EXTERNAL TABLE
Time: 175.165 ms
CREATE EXTERNAL TABLE
Time: 135.464 ms
FINISH CREATE EXTERNAL TABLES | 2011-01-25 07:54:58.18323-06
Next I create some tables.
The first two are basic heap tables and used to load the data and do some basic ETL from.
The next two are my impression and click tables, these are partitioned by day (that’s 365 partitions * 128 segments).
The last table is a profile table for further transaction testing that I haven’t made use of yet.
START CREATE NORMAL TABLES | 2011-01-25 07:54:58.184032-06
CREATE TABLE
Time: 381.024 ms
CREATE TABLE
Time: 300.298 ms
psql:big_data_test.sql:41: NOTICE:  CREATE TABLE will create partition “impression_1_prt_1″ for table “impression”
psql:big_data_test.sql:41: NOTICE:  CREATE TABLE will create partition “impression_1_prt_365″ for table “impression”
CREATE TABLE
Time: 53501.465 ms (53 seconds)
psql:big_data_test.sql:57: NOTICE:  CREATE TABLE will create partition “click_1_prt_1″ for table “click”
psql:big_data_test.sql:57: NOTICE:  CREATE TABLE will create partition “click_1_prt_365″ for table “click”
CREATE TABLE
Time: 60088.324 ms
CREATE TABLE
Time: 252.770 ms
FINISH CREATE NORMAL TABLES | 2011-01-25 07:56:52.70869-06
Next up start generating the data files
Data is of the form:
YYYSMJYHZFQCRXCB        JBBZ    00000000000000000000000000000000        0000222200002222000022220000222200002222000000001111    2010-07-05 22:53:27
START CREATE DATA FILES | 2011-01-25 07:56:52.709458-06
(creating 12,800,000,000 impression records)
result_text
————-
Time: 457891.573 ms (~7.5 minutes)
(same generation but only outputting 800,012,931 click records)
result_text
————-
Time: 447864.175 ms
FINISH CREATE DATA FILES | 2011-01-25 08:11:58.465866-06
Next I import the data using a very crude copy in from a file into my staging tables
these tables look like

bigdata_impression_stage (
hash16 varchar(16),
hash4 varchar(4),
incr_ident varchar(32),
bigstring varchar(52),
impression_time timestamp)
DISTRIBUTED BY (hash16

);

START IMPORT IMPRESSION STAGE DATA | 2011-01-25 08:11:58.466452-06
INSERT 0 12800000000
Time: 2734160.918 ms (~45 minutes, admittedly a very unoptimized import process though)
ANALYZE
Time: 121190.890 ms
FINISH IMPORT IMPRESSION STAGE DATA | 2011-01-25 08:59:33.818768-06
START IMPORT CLICK STAGE DATA | 2011-01-25 08:59:33.819516-06
INSERT 0 800012931
Time: 102449.285 ms
ANALYZE
Time: 1926.213 ms
FINISH IMPORT CLICK STAGE DATA | 2011-01-25 09:01:18.195508-06
Delete the data files
START CLEANUP DATA FILES | 2011-01-25 09:01:18.196396-06
result_text
————-
(0 rows)
Time: 88.866 ms
result_text
————-
(0 rows)
Time: 212.238 ms
FINISH CLEANUP DATA FILES | 2011-01-25 09:01:18.498019-06
Pull from my staging tables into tables of this form

impression (
cookie varchar(6),
publisher varchar(2),
site varchar(4),
impression_time timestamp
)
WITH (APPENDONLY=true, COMPRESSLEVEL=9, ORIENTATION=column, COMPRESSTYPE=zlib,
OIDS=FALSE
)
DISTRIBUTED BY (cookie)
PARTITION BY RANGE (impression_time) (
START ( date ’2010-01-01′ ) INCLUSIVE
END ( date ’2011-01-01′ ) EXCLUSIVE
EVERY ( INTERVAL ’1 DAY’)
);

Using the statement

INSERT INTO impression SELECT
substr(hash16,1,6),
substr(hash16,7,2),
hash4,
impression_time
FROM bigdata_impression_stage;
START INSERT INTO IMPRESSION FROM STAGE | 2011-01-25 09:01:18.498522-06
INSERT 0 12800000000
Time: 754964.529 ms (~12.5 minutes)
ANALYZE
Time: 217862.440 ms
FINISH INSERT INTO IMPRESSION FROM STAGE | 2011-01-25 09:17:31.335451-06
START INSERT INTO CLICK FROM STAGE | 2011-01-25 09:17:31.336244-06
INSERT 0 800012931
Time: 62466.581 ms
ANALYZE
Time: 133750.064 ms
FINISH INSERT INTO CLICK FROM STAGE | 2011-01-25 09:20:47.562561-06
FINISH RUN OF BIG DATA GEN | 2011-01-25 09:20:47.563334-06
Now we’ve got a couple of reasonable tables one of about 13 billion impressions and one of 800 million clicks.
Let’s do a couple queries and see what we come up with. Note that I haven’t created any indexes
RUN A COUPLE TEST QUERIES | 2011-01-25 09:20:47.563672-06
DO SELECT MIN(click_time), MAX(click_time) FROM click
min         |         max
---------------------+---------------------
2010-06-10 01:00:00 | 2010-12-21 04:24:28
(1 row)
Time: 1754.426 ms (2 seconds)
2011-01-25 09:20:49.31895-06
DO SELECT cookie, COUNT(1) FROM click GROUP BY cookie ORDER BY 2 DESC LIMIT 10
cookie | count
--------+-------
XXXXXX |    87
YYYYYY |    78
UUUUUU |    78
NNNNNN |    77
JJJJJJ |    75
RRRRRR |    75
WWWWWW |    74
MMMMMM |    73
IIIIII |    73
ZZZZZZ |    73
(10 rows)
Time: 7840.898 ms ( 8 seconds )
2011-01-25 09:20:57.160949-06
DO SELECT publisher, COUNT(1) FROM click GROUP BY publisher ORDER BY 2 DESC LIMIT 10
publisher |  count
-----------+---------
NN        | 2282895
PP        | 2281720
ZZ        | 2281617
TT        | 2281329
II        | 2280905
UU        | 2280861
MM        | 2280770
WW        | 2280389
OO        | 2280386
DD        | 2279800
(10 rows)
Time: 2265.925 ms  (2 seconds)
2011-01-25 09:20:59.427648-06
DO SELECT COUNT(1) FROM impression, click WHERE impression.cookie = click.cookie AND ( click.click_time – impression.impression_time) < INTERVAL ’2 seconds’
count
-------------
19623428623
(1 row)
Time: 197743.623 ms (3 minutes 17 seconds)
2011-01-25 09:24:17.172209-06
FINISHED RUN OF BIG DATA

The next steps will be to create  sql to populate the very basic profile data from the impressions and clicks. After that come up with a few nasty joins to throw at everything that’s been generated. Once I’m good with that I’ll work on packaging it all up into a more distributable format so people can compare benchmarks and see what can be done to spur talks of optimal platforms.

Tags: , ,

1 Comment

Alexandre Verri
May 30, 2012 at 7:53 pm

Hi scottk, it’s very interesting. Thank you for sharing this.


 

Reply

Copyright © 2014 SimpIT.com All rights reserved. Theme by Laptop Geek.