Tsoobame

software, japanese, kyudo and more

Understanding Dataloader

Dataloader is one of the packages I find more useful and smart from the ones I have in my toolbox.

I am going to set up a obvious naive example and follow the process to build a simple dataloader to understand its beauty and how useful it is.

About the project

We are going to create a view and api over a social network. Our users relations are:

User 1 friend of [ 2, 3 ]
User 2 friend of [ 1, 3 ]
User 3 friend of [ 1, 2, 4 ]
User 4 friend of [ 3, 5 ]
User 5 friend of [ 4 ]

The view can show the relation between users and their friends. We can show N levels of their friendship. We are not goint to look much at it in this post.

Users data can be found here.

The only dependency will be express.

Initial Setup

datasource.js

The datasource allows us to retrieve one or multiple users by id. Contract is not random, it is already based on the real dataloader so there will be minimal changes over the course of the post. Data is defined in a file within the project. Code is pretty simple:

const users = require('./users.json')


const getUsersFromFile = (ids) => ids.map(id => users.find(u => u.id === id))
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms))

async function loadMany(ids) {
    console.log(`GET /users?ids=${ids}`)

    await sleep(100)
    return getUsersFromFile(ids)
}

async function load(id) {
    const results = await loadMany([id])
    return results[0]
}

module.exports = {
    load,
    loadMany
}

The only interesting method is loadMany. We will print the requests to the simulated service so we can check the console. There will be a delay to resolve the promise, so we can simulate better and understand why dataloader is so good. 

A very important requirement is that data needs to be returned to the caller in the right order and all elements need to be returned (same length of ids and results arrays). This will be clear when we put in place the dataloader.

resolver.js

Resolver will use the datasource received by parameter to load friendship data about users. It can receive the levels of friends we want to get, so it will use a recursive approach to load friends of friends until all levels are fetched.

async function getFriends(datasource, user, levels) {
    if (levels == 0) {
        return { id: user.id, name: user.name }
    }

    const friends = await datasource.loadMany(user.friends)

    return {
        ...user,
        friends: await Promise.all(
             friends.map(f => getFriends(datasource, f, levels - 1))
        )
    }
}


async function getUserWithFriends(datasource, id, levels = 1) {
    const user = await datasource.load(id)
    return getFriends(datasource, user, levels)
}

module.exports = { getUserWithFriends }

It uses a brute force approach on purpose. The code is simple but far away from being optimal. In one method it looks obvious, but sometimes, when we are building graphql or similar apis, or complex workflows we might be doing exactly this kind of brute force requests.

view.js

Nothing advanced. Just render users friends in a nested way.

function render(user) {
    return `<div style="padding-left: 12px;background-color:#def">
            ${user.name}
            ${user.friends ? user.friends.map(u => render(u)).join('') : ""}
    </div>`
}


module.exports = {
    render
}

server.js


const express = require('express')
const PORT = 3000
const app = express()

const datasource = require('./datasource')
const resolver = require('./resolver')
const view = require('./view')


app.get(`/user-with-friends/:id`, async (req, res) => {
    const id = req.params.id
    const levels = req.query.levels || 1

    const user = await resolver.getUserWithFriends(datasource, id, levels)

    res.send(view.render(user))

})

app.listen(PORT, () => console.log(`Fakebook listening to ${PORT}`))


Run

node index.js

Test 1

We will render friends of user 1. Only 1 level:

http://localhost:3000/user-with-friends/1

If we check in our console we will find:

GET /users?ids=1
GET /users?ids=2,3

All good. We requested user 1 and their friends 2 and 3.

Test 2

Let's try by loading 3 levels:

http://localhost:3000/user-with-friends/1?levels=3

Things are getting interesting here:

GET /users?ids=1
GET /users?ids=2,3
GET /users?ids=1,3
GET /users?ids=1,2,4
GET /users?ids=2,3
GET /users?ids=1,2,4
GET /users?ids=2,3
GET /users?ids=1,3
GET /users?ids=3,5

We are loading data for users 1,2,3,4,5 but we are doing 9 requests. We are requesting the same users again and again. We could easily improve the situation adding some sort of cache per request.

Cache per request

We are going to add a cache to the system. It will be empty at the start of each request, so we do not need to worry about expirations. The benefits will be:

  • Do not request the same resource twice to the remote source during the same request.
  • As side effect, if we try to get the same resource twice during the same request, we will get the same data. So mutations of the resources in between a request will not provide incoherent results.

cache.js

Simple cache implementation:


function make(loadManyFn) {

    const cache = {}

    async function loadMany(ids) {
        const notCachedIds = ids.filter(id => !cache[id])

        if (notCachedIds.length > 0) {
            const results = await loadManyFn(notCachedIds)
            notCachedIds.forEach((id, idx) => cache[id] = results[idx])
        }

        return ids.map(id => cache[id])
    }

    return {
        load: async id => {
            const results = await loadMany([id])
            return results[0]
        },
        loadMany
    }

}

module.exports = { make }

Cache needs a function to retrieve multiple data by id (or in general by a key). It will check the data that is cached and request only the ids that are not found.

Implements the same contract as datasource.

server.js

Let's add this line to the server:

const cache = require('./cache')

And replace this line:

const user = await resolver.getUserWithFriends(datasource, id, levels)

with:

const user = await resolver.getUserWithFriends(cache.make(datasource.loadMany), id, levels)

Run

Let's run again the server and test the previous request:

http://localhost:3000/user-with-friends/1?levels=3

GET /users?ids=1
GET /users?ids=2,3
GET /users?ids=4
GET /users?ids=4
GET /users?ids=5

We could reduce the number of requests from 9 to 5, which is pretty good. But, what a momentwhat happened here? Why are we requesting id=4 twice?

If we unnest the request flow based on how nodejs works (and how we implemented our resolver) this is what happened:

  • 1 - Load user 1 => GET /users?ids=1
    • 2 - Load friends of 1: [2,3]=> GET /users?ids=2,3
      • 3.1. Load friends of 2: [1,3] => all cached
        • 4.1. Load friends of 1 : [2,3] => all cached
        • 4.2. Load friends of 3 : [1,2,4] => GET /users?ids=4
      • 3.2. Load friends of 3: [1,2,4] => GET /users?ids=4
        • 4.3. Load friends of 1: [2,3] => all cached
        • 4.4. Load friends of 2: [1,3] => all cached
        • 4.5. Load friends of 4: [3,5] => GET /users?ids=5

On 3.1 we had all friends of user 2 cached. So the code was straight to 4.2, than ran in parallel with 3.2. Both were waiting for the same user (4) and therefore made the same requests twice.

So with our simple cache, we did not reduce the requests to the minimun we wanted.

For example, if we did:

const users = await Promise.all(load(1), load(1))

There would be 2 requests before the cache has data for id=1.

Let's fix this and produce the ideal:

GET /users?ids=1
GET /users?ids=2,3
GET /users?ids=4
GET /users?ids=5

Dataloader

Using nodejs process.nextTick(...) we can postpone the execution of a given function to the end of the current event loop cycle. It is useful to run a given function after all variables are initialized for example.

From nodejs documentation

By using process.nextTick() we guarantee that apiCall() always runs its callback after the rest of the user's code and before the event loop is allowed to proceed.

Using it we can accumulate all the keys that are being requested during the same cycle (3.2 and 4.2 in the example above) and request them at the end. In the next cycle we would accumulate again the ones that were depending in the previous ones and so on.

This simple version of dataloader incorporates also code to accomplish the cache:


function make(loadManyFn) {

    const cache = {}
    let pending = []
    let scheduled = false
    function scheduleSearch() {
        if (pending.length > 0 && !scheduled) {
            scheduled = true
            Promise.resolve().then(() => process.nextTick(async () => {
                await runSearch()
                scheduled = false
            }))
        }
    }

    async function runSearch() {
        const pendingCopy = pending.splice(0, pending.length)
        pending = []

        if (pendingCopy.length > 0) {
            const results = await loadManyFn(pendingCopy.map(p => p.id))
            pendingCopy.forEach(({ resolve }, idx) => resolve((results[idx])))
        }

    }


    async function loadMany(ids) {
        const notCachedIds = ids.filter(id => !cache[id])

        if (notCachedIds.length > 0) {
            notCachedIds.map(id => {
                cache[id] = new Promise(resolve => {
                    pending.push({ id, resolve })
                })
            })

            scheduleSearch()
        }

        return Promise.all(ids.map(id => cache[id]))
    }

    return {
        load: async id => {
            const results = await loadMany([id])
            return results[0]
        },
        loadMany
    }

}


module.exports = { make }

Ignoring the part of the cache, the important bits are:

Accumulating requests

notCachedIds.map(id => {
    cache[id] = new Promise(resolve => {
        pending.push({ id, resolve })
    })
})

We will add to the list of pending ids the ones that are not cached. We will keep the id and the resolve method, so we can resolve them afterwards with the right value. We cache the promise itself in the hashmap. This would allow us to cache also rejected promises for example. So we do not request over and over the same rejection. It is not used in this implementation, though.

Scheduling the request

 function scheduleSearch() {
        if (pending.length > 0 && !scheduled) {
            scheduled = true
            Promise.resolve().then(() => process.nextTick(async () => {
                await runSearch()
                scheduled = false
            }))
        }
    }

That is where the magic happens. This function is short but is the most important one: We schedule/delay the request to the end of all the promises declarations.

    async function runSearch() {
        const pendingCopy = pending.splice(0, pending.length)
        pending = []

        if (pendingCopy.length > 0) {
            const results = await loadManyFn(pendingCopy.map(p => p.id))
            pendingCopy.forEach(({ resolve }, idx) => resolve((results[idx])))
        }

    }

Clone the ids (so they can be accumulated again after the search completes) and call the loadManyFn so we can resolve the promises we had pending. Remember the requirements of loadMany to return the data in the right order and all the elements ? This is where it is needed. We can reference the results by index and resolve the right pending promises.

Let's run it!

Execution

Again the same request:

http://localhost:3000/user-with-friends/1?levels=3

That produces the following output:

GET /users?ids=1
GET /users?ids=2,3
GET /users?ids=4
GET /users?ids=5

Exactly what we wanted.

Conclusion

- Dataloader is a great package that should be in all developers toolbox. Specially the ones implementing Graphql or similar Apis. 

- The resolvers in this example could be optimized but sometimes our requests are on different files at different levels that depend on some conditions. With Dataloader we can keep our file structure and code readability without damaging our performance, both on response time to our client and on number of requests spawn within our mesh.


Are you using Dataloader? Do you know any tool that accomplishes something similar? Do you now any other packages that in your opinion should be in all nodejs devs toolbox?

Graphql Stitching - Part 1

I am going to write a short (?) post about how to create a simple API Gateway that exposes two services using Graphql Stitching. I am assuming some knowledge about graphql and Apollo Server.
We will use express , nodejs and apollo for the service and a technique called schema stitching.
If you want to learn more about Graphql you can go to the official site.

Why do we need Api gateways and schema stitching

I will write a whole post about the reasons we had to use Graphql in our services and in our Api Gateway.
Here I am offering a short explanation:
In real world scenarios we are creating independent and autonomous (micro)services. The less data they share, the less they need to call each other and the less coupled they are, the better.
Many times a service manages entities (or parts of entities) that hold an id about another entity but does not need to know more details. For example an inventory service might manage productID and available units, but does not need to know about the name of the product or about its price. 
Inventory service will be able to run all its operations and apply the rules it manages without requesting information to any other service. 
Users, on the other hand, will need to see this scattered data together in one screen. In order to avoid too many requests from the UI, an API Gateway can offer a single endpoint where UI can request the data needed for a specific functionality/screen in one request, and the Gateway can orchestrate the calls to other services, cache results if needed, etc.

Let's start working

Let's create a folder as the root for our project:
mkdir graphql-stitching
cd graphql-stitching


Creating the songs service

We are going to create a simple service that offers data about songs.

mkdir songs
cd songs
npm init -y
npm install express graphql apollo-server-express body-parser
We are going to create our schema first:
touch schema.js
schema.js
const { makeExecutableSchema } = require("graphql-tools");
const gql = require('graphql-tag')

const songs = [
{ id: 1, title: "I will always love you" },
{ id: 2, title: "Lose yourself" },
{ id: 3, title: "Eye of the tiger"},
{ id: 4, title: "Men in Black" },
{ id: 5, title: "The power of love" },
{ id: 6, title: "My Heart will go on" }
];

const typeDefs = gql`
type Query {
songs: [Song]
song(songId: ID!): Song
}
type Song {
id: ID
title: String
}
`;

const resolvers = {
Query: {
songs: () => {
return songs;
},
song(parent, args, context, info) {
return songs.find(song => song.id === Number(args.songId));
}
}
};

module.exports = makeExecutableSchema({
typeDefs,
resolvers
});


We are defining a list of songs.
The type Song (id, title) and two queries for getting all songs and one song by id.

Let's create the api:
touch index.js
index.js:
const express = require('express')
const { ApolloServer } = require('apollo-server-express')
const cors = require('cors')
const schema = require('./schema')
const bodyParser = require('body-parser')

const app = express()
app.use(cors())
app.use(bodyParser.json())

const server = new ApolloServer({
playground: {
endpoint: '/api',
settings: {
'editor.cursorShape': 'block',
'editor.cursorColor': '#000',
'editor.theme': 'light'
}
},
schema
})

server.applyMiddleware({ app, path: '/api' })

app.listen(3000, () => {
console.log('Song services listening to 3000...')
})


We create a simple express service using apollo engine to expose both the api and the playground to tests our api.
node index.js
and open the songs api
You will see the playground, so you can run the first query:
{
  songs{
    id 
    title
  }
}
you should be able to see the results.

Creating the movies service

We are going to follow the same process. From the root of our project:
mkdir movies
cd movies
touch index.js
touch schema.js
npm init -y
npm install  express graphql apollo-server-express body-parser graphql-tag
index.js will be similar to the previous one. Only the port number needs to be different
const express = require('express')
const { ApolloServer } = require('apollo-server-express')
const cors = require('cors')
const schema = require('./schema')
const bodyParser = require('body-parser')

const app = express()
app.use(cors())
app.use(bodyParser.json())

const server = new ApolloServer({
playground: {
endpoint: '/api',
settings: {
'editor.cursorShape': 'block',
'editor.cursorColor': '#000',
'editor.theme': 'light'
}
},
schema
})

server.applyMiddleware({ app, path: '/api' })

app.listen(3001, () => {
console.log('Movie services listening to 3001...')
})


Schema will be very similar:
const { makeExecutableSchema } = require("graphql-tools");
const gql = require('graphql-tag')

const movies = [
{ id: 1, title: "The Bodyguard", mainSongId: 1 },
{ id: 2, title: "8 Mile", mainSongId: 2 },
{ id: 3, title: "Rocky III", mainSongId: 3},
{ id: 4, title: "Men in Black", mainSongId: 4 },
{ id: 5, title: "Back to the Future", mainSongId: 5 },
{ id: 6, title: "Titanic", mainSongId: 6 }
];

const typeDefs = gql`
type Query {
movies: [Movie]
movie(movieId: ID!): Movie
}
type Movie {
id: ID!
title: String!
mainSongId: ID!
}
`;

const resolvers = {
Query: {
movies: () => {
return movies;
},
movie(parent, args, context, info) {
return movies.find(movie => movie.id === Number(args.movieId));
}
}
};

module.exports = makeExecutableSchema({
typeDefs,
resolvers
});


The difference is that movie has a reference to songs. Specifically mainSongId. Since both services are isolated and are autonomous, movie service does not know where songs service is, or what data a songs holds. Only knows that a movie has a main song and it holds its ID.

If we run the project in the same way
node index.js
we can see the playground and run our test queries.

Let's start the interesting part. Our Api gateway

We are going to create the same files. From project root:
mkdir apigateway
cd apigateway
touch index.js
touch schema.js
npm init -y
npm install  express graphql apollo-server-express body-parser graphql-tag apollo-link-http node-fetch

The schema will created based on the schemas of the other services, so we are going to stitch and expose them in the api gateway.
schema.js
const {
introspectSchema,
makeRemoteExecutableSchema,
mergeSchemas,
} = require("graphql-tools")
const { createHttpLink } = require("apollo-link-http");
const fetch = require("node-fetch");


const MoviesUrl = 'http://localhost:3001/api'
const SongsUrl = 'http://localhost:3000/api'


async function createServiceSchema(url) {
const link = createHttpLink({
uri: url,
fetch
});
const schema = await introspectSchema(link);
return makeRemoteExecutableSchema({
schema,
link
});
}

async function createSchemas() {
const movieSchema = await createServiceSchema(SongsUrl);
const songsSchema = await createServiceSchema(MoviesUrl);

return mergeSchemas({ schemas: [songsSchema, movieSchema] })
}

module.exports = createSchemas()


 
As you can see in the code, the schema is generated by requesting the schemas of both APIs and merging them. 
One difference is, now we need to request this data before being able to start the apigateway, so the index.js will be slightly different:
const express = require('express')
const { ApolloServer } = require('apollo-server-express')
const cors = require('cors')
const createSchema = require('./schema')
const bodyParser = require('body-parser')

const app = express()
app.use(cors())
app.use(bodyParser.json())

createSchema.then(schema => {
const server = new ApolloServer({
playground: {
endpoint: '/api',
settings: {
'editor.cursorShape': 'block',
'editor.cursorColor': '#000',
'editor.theme': 'light'
}
},
schema
})
server.applyMiddleware({ app, path: '/api' })
app.listen(4000, () => {
console.log('Graphql listening to 4000...')
})
})

Before starting the listener, the schema is requested and merged so we can expose it in our api.

We need to run the previous services in order to be able to execute this one. From the root of the project:
node movies/index.js &
node songs/index.js &
node apigateway/index.js
If we go to the api gateway playground we can query movies and songs in the same query:
{
  movies{
    id
    title
    mainSongId
  }
  
  songs {
    id
    title
  }
}

This was an introduction to schema stitching. In part 2 I will show some more concepts and real case scenarios like extending the services' schema in the api gateway with custom resolvers, how to optimize by using dataloaders.

If you have any questions about graphql schema stitching or about api gateway in general, please add your comment or contact me.


How to become a good developer

Some days ago, someone in my company asked me a fair-but-not-easy-to-answer question: 

How do you think I can become a good software developer ?

Without much time to prepare to this question my first answer was about trying to build a real project. You can read and follow a lot of tutorials (of course they are important) but you will not really learn until you put them into practise in a real project. 

If your company does not use a given technology you want to learn, just start your own project using it. It does not need to be successful in terms of business, but needs to be real. 

After this quick answer, I started to think on when I started to feel like I was becoming a mature software developer. This is the outcome of my thoughts:


Understand that the code you write is for others to be read

 This is something very basic, but sometimes we forget. The code we are writing today, will be read and maintained by others or by ourselves in the future. Also the effort a company (or a group of developers) takes for developing an application is very small compared to the effort they will invest in maintaining it (solving issues, adding features, improving some areas, etc.).

Some basic rules for accomplishing this are:

  • Do not relay on comments to explain the code. Code needs to be self explanatory, so by reading the code everyone needs to understand WHAT is being done. Comments should be used only when the WHY of what we are doing is not clear, or to explain lines of code that might look like potential bugs to others.
  • We all like to write perfect algorithms and super-optimal code, but it usually makes code difficult to read. Do not write too much magic in your code, and if you need it, enclose it in methods with a clear naming.
  • Perform code reviews frequently (during every push for example). The reviewer should understand the code without many explanations from the developer that wrote the code, so make sure that variables and method names are clear.
  • Use an external tool for validating the code style. There are plenty of them, and they ensure that the code is uniform throughout the project.

Decide clear naming for methods and classes

This is related to the previous aspect, but it has special importance. If you think of how to name a method or a class you are deciding its responsibility and its boundaries, so it will help to design a modular system. 

The name of the class, method etc. need to make clear to everyone what it holds. For this to be true, we need to ensure that the contents of the class or the behaviour of the method, its inputs and its outputs match the name you choose. If they not, or if they change overtime, no not be afraid of changing the name to be accurate.

Examples of bad names include the typical *Helper. This name states who is helping to, but it does not helps to understand what this class holds. For example StringHelper can hold validation methods, formatting methods, constants, localisation, etc.

Learn about Unit Testing and keep it in mind when developing

When unit testing classes and methods, you need to think about the boundaries of those classes and the responsibilities and of method. 

This includes what output needs to deliver to the caller, how to react to a given input and will arise lots of “what if” questions that will make the code very robust.

It will also help you thinking on the responsibilities this class needs to have and what responsibilities should be delegated to a different one, creating the dependency map in an evolutionary way and making the dependency injection something natural.

Conclusions

As you can see my main concerns about good developers are not about technical skills but about clarity in the code and structured projects.

If all this is accomplished a team can maintain the code easily, the can look for better algorithms for complex calculations areas in the code, or look for third party libraries that can fulfil some of the dependencies we might encounter. 

Of course we can become experts on some technologies or some tactics, or infrastructures, but first we need to change or mindset and work for others. Not only for our customers, but also for other developers (providing easy-to-read code) and projects in our company (offering ways of reusing our code through libraries) or even 3rd party companies (by offering APIs).


What would you recommend so that junior developers can become good software engineers? 



Spring Cloud - Introduction

I am starting to learn about the Spring Cloud framework and the Netflix OSS. I will not add much information since I am newbie here, but I just wanted to share a video (well actually a list of videos) that show its capabilities i.e. how easy is building distributed applications using the framework.

Of course the easiness has a price, since you really get bounded to the framework. But if you want to be highly productive and you care about the business and not so much about controlling the infrastructure at low level, this could be a perfect way to go. 

You can see the first video with the introduction by Matt Stine here:


You can see the whole playlist (7 videos) clicking here


Event sourcing: introduction by Greg Young

Some days ago I watched a video about event sourcing and cqrs where Greg Young was introducing the concepts in a very nice way.

I remember once in a project where we had to record every action that every user was performing in the system, so we had the usual database with the current state of the system that included a historical table with all the actions performed.

Event sourcing is about assuming that all the information that a system holds is based on the events that produced changes in that system. So they worth being stored. Instead of focusing on the current state of the data, it focus in the process that lead to the current state. For regulated enviroments it is a must to have all this events stored so it can naturally fall into the event sourcing pattern, but many other business can take a lot of benefit on using event sourcing.

This way of working is not new. Banks, lawyers, human resources, etc. have been working in this way since long time ago (before information was managed by something called software). The current state of your account is just the result of executing all the events/actions that affected it. No one would like to know their savings by having a table with the current state and not all the movements on the account. 

It has a lot of advantatges, like being able to go back in time and see how the system was in a given point of time, or calculate the evolution of a system, and also to be able to analyze the data in a deeper way than just having the current state. So for business intelligence reports and dashboards it is very useful.

It has also some ugly disadvantatges. Since you focus on storing the changes that are made to your system, you need to execute all the events in order to calculate the current state of the system. It is not very performant if you need to show this information to your users in a reasonable time. To minimize this you can create snapshots, so you only need to execute the events from the latest snapshot to the end, but still it would be too expensive to query data to show it on the screens. That is why event sourcing cannot work without CQRS.

I will talk about CQRS in a new post, but there is also a small introduction to it in the attached video.If you want to know more about event sourcing and to learn from someone more acknowledgeable, here you have the video:

Greg Young - CQRS and Event Sourcing - Code on the Beach 2014



Are you using event sourcing in your companies? Did you find any problems difficult to solve?