Re: [GSoC] Clustering in MADlib - status update

Lists: pgsql-hackers
From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Hai Qian <hqian(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: [GSoC] Clustering in MADlib - status update
Date: 2014-05-25 17:17:54
Message-ID: CAJeaomUZfGXKyvUB4-6yxK5m+dVMLd+w+5DEm5MbYf2kErB0XA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Here is my first report. You can also find it on my Gitlab [0].
Week 1 - 2014/05/25

For this first week, I have written a test script that generates some
simple datasets, and produces an image containing the output of the MADlib
clustering algorithms.

This script can be called like this:

./clustering_test.py new ds0 -n 8 # generates a dataset called "ds0"
with 8 clusters
./clustering_test.py query ds0 -o output.png # outputs the result of
the clustering algorithms applied to ds0 in output.png

See ./clustering_test.py -h for all the available options.

An example of output can be found here
[1].<http://git.viod.eu/viod/gsoc_2014/blob/master/clustering_test/example_dataset.png>

Of course, I will keep improving this test script, as it is still far from
perfect; but for now, it does approximately what I want.

For next week, I'll start working on the implementation of k-medoids in
MADlib. As a reminder, according to the timeline I suggested for the
project, this step must be done on May 30. Depending on the problems I will
face (mostly lack of knowledge of the codebase, I guess), this might not be
finished on time, but it should be done a few days later (by the end of
next week, hopefully).

Attached is the patch containing everything I have done this week, though
the git log might be more convenient to read.

Regards,

Maxence A.

[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
[1]
http://git.viod.eu/viod/gsoc_2014/blob/master/clustering_test/example_dataset.png

--
Maxence Ahlouche
06 06 66 97 00

Attachment Content-Type Size
week1.patch text/x-patch 25.5 KB

From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Hai Qian <hqian(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-01 20:06:54
Message-ID: CAJeaomUhs0rfG13eYhn2UoTVWQ+2RZn2Y2-U9aVgPR+9afegUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all!

I've pushed my report for this week on my repo [0]. Here is a copy!
Attached is the patch containing my work for this week.
Week 2 - 2014/01/01

This week, I have worked on the beginning of the kmedoids module.
Unfortunately, I was supposed to have something working for last Wednesday,
and it is still not ready, mostly because I've lost time this week by being
sick, and by packing all my stuff in preparation for relocation.

The good news now: this week is my last school (exam) week, and that means
full-time GSoC starting next Monday! Also, I've studied the kmeans module
quite thoroughly, and I can finally understand how it all goes on, at the
exception of one bit: the enormous SQL request used to update the
IterationController.

For kmedoids, I've abandoned the idea of making the loop by myself and have
decided instead to stick to copying kmeans as much as possible, as it seems
easier than doing it all by myself. The only part that remains to be
adapted is that big SQL query I haven't totally understood yet. I've asked
the help of Atri, but surely the help of an experienced MADlib hacker would
speed things up :) Atri and I would also like to deal with this through a
voip meeting, to ease communication. If anyone wants to join, you're
welcome!

As for the technology we'll use, I have a Mumble server running somewhere,
if that fits to everyone. Otherwise, suggest something!

I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
(exam weeks are quite light).

This week, I have also faced the first design decisions I have to make. For
kmedoids, the centroids are points of the dataset. So, if I wanted to
identify them precisely, I'd need to use their ids, but that would mean
having a prototype different than the kmeans one. So, for now, I've decided
to use the points coordinates only, hoping I will not run into trouble. If
I ever do, switching to ids should'nt be too hard. Also, if the user wants
to input initial medoids, he can input whatever points he wants, be they
part of the dataset or not. After the first iteration, the centroids will
anyway be points of the dataset (maybe I could just select the points
nearest to the coordinates they input as initial centroids).

Second, I'll need to refactor the code in kmeans and kmedoids, as these two
modules are very similar. There are several options for this:

1. One big "clustering" module containing everything clustering-related
(ugly but easy option);
2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
submodules (the best imo, but I'm not sure it's doable);
3. A "clustering_utils" module at the same level as the others (less
ugly than the first one, but easy too).

Any opinions?

Next week, I'll get a working kmedoids module, do some refactoring, and
then add the extra methods, similar to what's done in kmeans, for the
different seedings. Once that's done, I'll make it compatible with all
three ports (I'm currently producing Postgres-only code, as it's the
easiest for me to test), and write the tests and doc. The deadline for this
last step is in two weeks; I don't know yet if I'll be on time by then or
not. It will depend on how fast I can get kmedoids working, and how fast
I'll go once I'm full time GSoC.

Finally, don't hesitate to tell me if you think my decisions are wrong, I'm
glad to learn :)
[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst

--
Maxence Ahlouche
06 06 66 97 00

Attachment Content-Type Size
week2.patch text/x-patch 9.1 KB

From: Hai Qian <hqian(at)gopivotal(dot)com>
To: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-02 17:16:30
Message-ID: CACGxcfQ_KurZ1gtaa8HzqKkdQw52Qi2RtBQR01_P=HuMF-byXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I like the second option for refactoring the code. I think it is doable.

And where is your code on Github?

Hai

--
*Pivotal <http://www.gopivotal.com/>*
A new platform for a new era

On Sun, Jun 1, 2014 at 1:06 PM, Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com
> wrote:

> Hi all!
>
> I've pushed my report for this week on my repo [0]. Here is a copy!
> Attached is the patch containing my work for this week.
> Week 2 - 2014/01/01
>
> This week, I have worked on the beginning of the kmedoids module.
> Unfortunately, I was supposed to have something working for last Wednesday,
> and it is still not ready, mostly because I've lost time this week by being
> sick, and by packing all my stuff in preparation for relocation.
>
> The good news now: this week is my last school (exam) week, and that means
> full-time GSoC starting next Monday! Also, I've studied the kmeans module
> quite thoroughly, and I can finally understand how it all goes on, at the
> exception of one bit: the enormous SQL request used to update the
> IterationController.
>
> For kmedoids, I've abandoned the idea of making the loop by myself and
> have decided instead to stick to copying kmeans as much as possible, as it
> seems easier than doing it all by myself. The only part that remains to be
> adapted is that big SQL query I haven't totally understood yet. I've asked
> the help of Atri, but surely the help of an experienced MADlib hacker would
> speed things up :) Atri and I would also like to deal with this through a
> voip meeting, to ease communication. If anyone wants to join, you're
> welcome!
>
> As for the technology we'll use, I have a Mumble server running somewhere,
> if that fits to everyone. Otherwise, suggest something!
>
> I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
> (exam weeks are quite light).
>
> This week, I have also faced the first design decisions I have to make.
> For kmedoids, the centroids are points of the dataset. So, if I wanted to
> identify them precisely, I'd need to use their ids, but that would mean
> having a prototype different than the kmeans one. So, for now, I've decided
> to use the points coordinates only, hoping I will not run into trouble. If
> I ever do, switching to ids should'nt be too hard. Also, if the user wants
> to input initial medoids, he can input whatever points he wants, be they
> part of the dataset or not. After the first iteration, the centroids will
> anyway be points of the dataset (maybe I could just select the points
> nearest to the coordinates they input as initial centroids).
>
> Second, I'll need to refactor the code in kmeans and kmedoids, as these
> two modules are very similar. There are several options for this:
>
> 1. One big "clustering" module containing everything
> clustering-related (ugly but easy option);
> 2. A "clustering" module and "kmeans", "kmedoids", "optics", "utils"
> submodules (the best imo, but I'm not sure it's doable);
> 3. A "clustering_utils" module at the same level as the others (less
> ugly than the first one, but easy too).
>
> Any opinions?
>
> Next week, I'll get a working kmedoids module, do some refactoring, and
> then add the extra methods, similar to what's done in kmeans, for the
> different seedings. Once that's done, I'll make it compatible with all
> three ports (I'm currently producing Postgres-only code, as it's the
> easiest for me to test), and write the tests and doc. The deadline for this
> last step is in two weeks; I don't know yet if I'll be on time by then or
> not. It will depend on how fast I can get kmedoids working, and how fast
> I'll go once I'm full time GSoC.
>
> Finally, don't hesitate to tell me if you think my decisions are wrong,
> I'm glad to learn :)
> [0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
>
>
> --
> Maxence Ahlouche
> 06 06 66 97 00
>


From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: Hai Qian <hqian(at)gopivotal(dot)com>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-02 18:09:18
Message-ID: CAJeaomXamLK84CBZd+jiT75eN1wo2PgLiWhbBUHfRAZ67_wwPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

2014-06-02 19:16 GMT+02:00 Hai Qian <hqian(at)gopivotal(dot)com>:

> I like the second option for refactoring the code. I think it is doable.
>
> And where is your code on Github?
>

It's not on Github, but on my own Gitlab (a self-hosted open-source
alternative to github). You can find it here [0]. I'm using two repos: one
is a clone of madlib, the other contains my reports, my test script and
other stuff.

[0] http://git.viod.eu/public/

--
Maxence Ahlouche
06 06 66 97 00


From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, Hai Qian <hqian(at)gopivotal(dot)com>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-15 22:01:15
Message-ID: CAJeaomUnt0LBQGW_xKa904FZPq3FJ=TXo7T4wXUD+fZ+EYB-=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi! Here is my report for the last two weeks.Weeks 3 and 4 - 2014/06/15

During my third week, I haven't had time to work on GSoC a lot, because of
my exams and my relocation (that's why I didn't deem necessary to post a
report last Sunday). But last week has been much more productive, as I am
now working full time!

I have developped an aggregate that computes the sum of pairwise
dissimilarities in a cluster, for a given medoid. Thanks to Hai and Atri, I
have also developped the main SQL function that actually computes the
k-medoids. This function is still under debugging, so I have not committed
it yet.

According to my planning, I am not on time: I should have finished working
on k-medoids on Friday. When I made this timeline, I largely underestimated
the time needed to get started in this project, and overestimated the time
I thought I could spend on GSoC during my exams. But things will now go
much faster!

As for our weekly phone call, I have lots of difficulties understanding
what is said, partly because of me not being used to hearing english, but
mostly because of low quality sound. Last time, I hardly understood half of
what's been said; which is quite unfortunate, given that I'm supposed to
take advices during this phone call. So I'd like to suggest an alternative:
an IRC channel, for example. And for those who don't have an IRC client
ready: http://webchat.freenode.net/ . For example, the channel #gsoc-madlib
would surely be appropriate :) Also, I've had a change in my timetable,
which makes Tuesday inconvenient for this phone call. Is it possible to
change the day? I'm available at this hour on Monday, Wednesday and
Thursday. Of course, if this change annoys too much people, I'll deal with
Tuesday :)

Finally, for the coming week, I'll finish debugging k-medoids, write all
the secondary functions (e.g. random inital medoids), and write the doc.

Regards,

Maxence A.

--
Maxence Ahlouche
06 06 66 97 00


From: Maxence Ahlouche <maxence(dot)ahlouche(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, "devel(at)madlib(dot)net" <devel(at)madlib(dot)net>
Cc: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Andreas Scherbaum <ascherbaum(at)gopivotal(dot)com>, Caleb Welton <cwelton(at)gopivotal(dot)com>, Sujit Philip <sphilip(at)gopivotal(dot)com>, Marc Pantel <Marc(dot)Pantel(at)enseeiht(dot)fr>, Hai Qian <hqian(at)gopivotal(dot)com>
Subject: Re: [GSoC] Clustering in MADlib - status update
Date: 2014-06-22 22:16:30
Message-ID: CAJeaomUVSOXxx-hWR5fsL5GS66ntSuLGixnJe5UO7OW7rgMOrg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

Here's my report for week 5.

Week 5 - 2014/06/22

This week has been full of debugging of the main SQL function. The previous
week, I had been able to come up with a working function to compute a
medoid for a given group of points, but since then I've struggled to
integrate it with the rest of the SQL. Some errors were trivial (for
example some parameters that I had written with underscores instead of
using camelCase - Hai spotted this one, I think i'd never have found it by
myself), others less so. But it's coming!

According to the timeline I had planned at the beginning on the project,
I'm definitely late. The module I'm still writing should have been finished
last week, and it's not even working yet. It seems I've been far too
optimist in this timeline. For the second step, as I'll have less time than
expected, I'm thinking to switch from OPTICS to DBSCAN, which at least I
have fully understood (OPTICS is quite complicated). Is everyone ok with
this?

Next week is the evaluation week. Hopefully I'll be allowed to continue
working on this project, even though I haven't provided much result until
now :p As for me, I don't have to complain: I've always been provided
patience and clear answers to my questions. Only the phone calls didn't
turn as good as they sounded, but this problem will be fixed at our next
meeting, as we'll now use IRC!